REPORT  DOCUMENTATION  PAGE 

Form  Approved  ^ 

0MB  No.  0704-0188 

piihiir  renortino  burden  for  this  collection  of  information  is  estimated  to  averape  1  hour  per  response,  including  the  time  for  reviewinQ  instructions,  searching  existing  data  sources, 
aathering^and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  cornments  regarding  this  burden  i 

rniiArtinn  nf  information  tncludina  suflciostlons  for  roducino  this  burdon,  to  Wastiington  Haadquartors  Sorvicos,  Oiroctorato  for  Information  Oporations  and  Roports,  1215  Jefforson 
/flShmv  S^rta  ^  the  Office  of  Management  and  Budget.  Paperwork  Reduction  Project  (0704-0188).  Washington.  DC  20503, 

1.  AGENCY  USE  ONLY  (Leave  Blank)  2.  REPORT  DATE  3.  REPORT  TYPE  AND  DJ 

Sent.  9.  1999  i’inal,  Nov.  1,  19 

VTES  COVERED 

8  to  Sent.  30,  1999 

4.  TITLE  AND  SUBTITLE 

A  1999  Workshop  on  Heterogeneous  Computing 

5.  FUNDING  NUMBERS 

N00014-99-1-01 17 

6.  AUTHOR(S) 

H.  J.  Siegel 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

School  of  Electrical  and  Computer  Engineering 

Purdue  University 

West  Lafayette,  IN  47907-1285 

8,  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING  /  MONITORING  AGENQY  NAME(S)  AND  ADDRESS(ES) 

Dr.  Andre  M.  van  Tilborg,  Director 

Math,  Computer  &  Information  Sciences  Division 

Office  of  Naval  Research 
'  Arlington,  VA  22217-5660 

1 0.  SPONSORING  /  MONITORING 

AGENCY  REPORT  NUMBER 

11.  SUPPLEMENTARY  NOTES 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 

12b.  DISTRIBUTION  CODE 

Unlimited 

13.  ABSTP{ACT  (Maximum  200  words)  .  i  inn\ 

This  grant  funded  the  proceedings  of  the  8th  Heterogeneous  Computing  Workshop  (HCW  yy;, 

which  was  held  on  April  12,  1999.  HCW  ’99  was  part  of  the  merged  symposium  of  the  13th 
"^international  Parallel  Processing  Symposium  and  thie  10th  Symposium  on  Parallel  and  Distribute! 

Processing  (IPPS/SPDP  1999),  which  was  sponsored  by  the  IEEE  Computer  Society  Technical 
!  Committee  on  Parallel  Processing  and  held  in  cooperation  with  ACM  SIGARCH.  Heterogeneous 
computing  systems  range  from  diverse  elements  within  a  single  computer  to  coordinated, 
geographically  distributed  machines  with  different  architectures.  A  heterogeneous  computing 
system  provides  a  variety  of  capabilities  that  can  be  orchestrated  to  execute  multiple  tasks 
with  varied  computational  requirements.  Applications  in  these  environments  achieve  ^ 
performance  by  exploiting  the  affinity  of  different  tasks  to  different  computational 
platforms  or  paradigms,  while  considering  the  overhead  of  inter-task  communication  and  the 
coordination  of  distinct  data  sources  and/or  administrative  domains.  Topics  representative 
of  those  in  the  proceedings  include:  network  profiling,  configuration  tools,  scheduling 
tools,  analytic  benchmarking,  programming  paradigms,  problem  mapping,  processor  assignment 
and  scheduling,  fault  tolerance,  programming  tools,  processor  selection  criteria,  and 
compiler  assistance.  — 


14.  SUBJECT  TERMS 

heterogeneous  computing 

computing 

,  distributed  computing,  high-performance 

/ 

15.  NUMBER  OF  PAGES 

1 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 

18.  SECURITY  CLASSIFICATION 

19.  SECURITY  classification' 

20.  UMITATION  OF  ABSTRACT 

OF  REPORT 

OF  THIS  PAGE 

OF  ABSTRACT 

UNCLASSIFIED 

UNCLASSIFIED 

UNCLASSIFIED 

UNLIMITED 

NSN  7540-01-280-5500 


Standard  Form  298  (Rev.  2-89) 
Presciibed  by  ANSI  Std.  Z39-18 
298-102 


PURDUE  UNIVERSITY 


School  of  Electrical  and  Computer  Engineering 
1285  Electrical  Engineering  Building 
West  Lafayette,  indiana  47907-1285,  USA 
E-mail:  hj@purdue.edu 


September  9, 1999 


Prof.  H.  J.  Siegel 
Office  Phone:  765-494-3444 
Office  Fax:  765-494-2706 
Home  Phone:  765-743-3290 


Defense  Technical  Information  Center 
8725  John  J.  Kingman  Road 
STE0944 

Ft.  Belvoir,  Virginia  22060-6218 


Enclosed  is  the  final  report  (that  is,  form  SF-298)  for  ONR  grant  number  N00014-99-1- 
0117,  which  supported  the  publication  of  the  enclosed  workshop  proceedings. 

ONR’s  support  is  greatly  appreciated. 


Yours  truly. 


H.  J.  Siegel 
Professor  of 

Electrical  and  Computer  Engineering 


cc:  Dr.  Andre  van  Tilborg,  ONR 
Grant  Administrator,  ONR 
Purdue  ECE  Business  Office 


Proceedings 


Eighth  Heterogeneous 
Computing  Workshop 
(HCW  ’99) 


Proceedings 


Eighth  Heterogeneous 
Computing  Workshop 
(HCW  ’99) 

April  12,  1999 
San  Juan,  Puerto  Rico 


Edited  by 

Viktor  K.  Prasanna,  University  of  Southern  California 

Cosponsored  by 

IEEE  Computer  Society's  Technical  Committee  on  Parallel  Processing 

U.S.  Office  of  Naval  Research 


Industrial  Affiliate 

NOEMIX 


Computer 

SOCIETY 


Los  Alamitos,  California 
Washington  •  Brussels  •  Tokyo 


DTIC  QUALITY  INSPECTED  4 


Copyright  ©  1999  by  The  Institute  of  Electrical  and  Electronics  Engineers,  Inc. 

All  rights  reserved 


Copyright  and  Reprint  Permissions'.  Abstracting  is  permitted  with  credit  to  the  source.  Libraries  may 
photocopy  beyond  the  limits  of  US  copyright  law,  for  private  use  of  patrons,  those  articles  in  this  volume  that 
carry  a  code  at  the  bottom  of  the  first  page,  provided  that  the  per-copy  fee  indicated  in  the  code  is  paid  through  the 
Copyright  Clearance  Center,  222  Rosewood  Drive,  Danvers,  MA  01923. 

Other  copying,  reprint,  or  republication  requests  should  be  addressed  to:  IEEE  Copyrights  Manager,  IEEE 
Service  Center,  445  Hoes  Lane,  P.O.  Box  133,  Piscataway,  NJ  08855-1331. 

The  papers  in  this  book  comprise  the  proceedings  of  the  meeting  mentioned  on  the  cover  and  title  page.  They 
reflect  the  authors*  opinions  and,  in  the  interests  of  timely  dissemination,  are  published  as  presented  and 
without  change.  Their  inclusion  in  this  publication  does  not  necessarily  constitute  endorsement  by  the  editors, 
the  IEEE  Computer  Society,  or  the  Institute  of  Electrical  and  Electronics  Engineers,  Inc. 


IEEE  Computer  Society  Order  Number  PR00107 
ISBN  0-7695-0107-9 
0-7695-0108-7  (microfiche) 
0-7695-0109-5  (casebound) 

ISSN  1097-5209 


Additional  copies  may  be  ordered  from: 


IEEE  Computer  Society 
Customer  Service  Center 
10662  Los  Vaqueros  Circle 
P.O.  Box  3014 

Los  Alamitos,  CA  90720-1314 
Tel:+  1-714-821-8380 
Fax:  +  1-714-821-4641 
E-mail:  cs.books@computer.org 


IEEE  Service  Center 
445  Hoes  Lane 
P.O.  Box  1331 
Piscataway,  NJ  08855-1331 
Tel:  +  1-908-981-1393 
Fax:  +  1-908-981-9667 
mis.custserv@computer.org 


IEEE  Computer  Society 
Asia/Pacific  Office 
Watanabe  Bldg.,  1-4-2 
Minami-Aoyama 
Minato-kuTokyo  107-0062 
JAPAN 

Tel:  +  81-3-3408-3118 
Fax:  +  81-3-3408-3553 
tokyo.ofc@computer.org 


Editorial  production  by  Lorretta  Palagi 
Cover  art  design  and  production  by  Alex  Torres 
Printed  in  the  United  States  of  America  by  Technical  Communication  Services 


IEEE 


Computer 

SOCIETY 


IEEE 


Table  of  Contents 


Message  from  the  General  Chair . vii 

Message  from  the  Program  Chair . - . viii 

Message  from  the  Steering  Committee  Chair. . ix 

Organizing  Committees . x 

Session  I:  Comparisons  of  Mapping  Heuristics 

Chair:  Jon  Weissman,  University  of  Texas  at  San  Antonio,  TX,  USA 

Task  Scheduling  Algorithms  for  Heterogeneous  Processors . 3 

Haluk  Topcuoglu,  Salim  Hariri,  and  Min-You  Wu 

A  Comparison  Study  of  Static  Mapping  Heuristics  for  a  Class  of  Meta-tasks  on  Heterogeneous 

Computing  Systems . 15 

Tracy  D.  Braun,  Howard  Jay  Siegel,  Noah  Beck,  Ladislau  L.  Boldni, 

Muthucumaru  Maheswaran,  Albert  L  Reuther,  James  P.  Robertson,  Mitchell  D.  Theys, 

Bin  Yao,  Debra  Hensgen,  and  Richard  F.  Freund 

Dynamic  Matching  and  Scheduling  of  a  Class  of  Independent  Tasks  onto  Heterogeneous 

Computing  Systems . 50 

Muthucumaru  Maheswaran,  ShoukatAli,  Howard  Jay  Siegel,  Debra  Hensgen,  and 
Richard  F.  Freund 

Session  II:  Design  Tools 

Chair:  Ishfaq  Ahmad,  Hong  Kong  University  of  Science  and  Technology,  Hong  Kong 

An  On-Line  Performance  Visualization  Technology . 47 

Aleksandar  Bakic,  Matt  W.  Mutka,  and  Diane  T.  Rover 

Heterogeneous  Distributed  Virtual  Machines  in  the  Harness  Metacomputing  Framework . 60 

Mauro  Migliardi  and  Vaidy  Sunderam 

Parallel  C++  Programming  System  on  Cluster  of  Heterogeneous  Computers . 73 

Yutaka  Ishikawa,  Atsushi  Hori,  Hiroshi  Tezuka,  Shinji  Sumimoto,  Toshiyuki  Takahashi, 
and  Hiroshi  Harada 

Are  CORBA  Services  Ready  to  Support  Resource  Management  Middleware  for  Heterogeneous 

Computing? . 83 

Alpay  Duman,  Debra  Hensgen,  David  St.  John,  and  Taylor  Kidd 

Session  III:  Modeling  and  Analysis 

Chair:  Steve  Chapin,  University  of  Virginia,  Charlottesville,  VA,  USA 

Statistical  Prediction  of  Task  Execution  Times  Through  Analytic  Benchmarking  for 

Scheduling  in  a  Heterogeneous  Environment . 99 

Michael  A.  Iverson,  Fusun  Ozgiiner,  and  Lee  C.  Potter 

Simulation  of  Task  Graph  Systems  in  Heterogeneous  Computing  Environments . 112 

Noe  Lopez-Benitez  and  Ja~Young  Hyon 

Communication  Modeling  of  Heterogeneous  Networks  of  Workstations  for  Performance 

Characterization  of  Collective  Operations . 125 

Mohammad  Banikazemi,  Jayanthi  Sampathkumar,  Sandeep  Prabhu,  Dhabaleswar  K.  Panda, 
and  P.  Sadayappan 


Session  IV:  Task  Assignment  and  Scheduling 

Chair:  Fusun  Ozguner,  The  Ohio  State  University,  Columbus,  OH 

Multiple  Cost  Optimization  for  Task  Assignment  in  Heterogeneous  Computing 

Systems  Using  Learning  Automata . 

Raju  D.  Venkataramana  and  N.  Ranganathan 

On  the  Robustness  of  Metaprogram  Schedules . 

Ladislau  Boloni  and  Dan  C.  Marinescu 

A  Unified  Resource  Scheduling  Framework  for  Heterogeneous  Computing  Environments . 

Ammar  H.  Alhusaini,  Viktor  K.  Prasanna,  and  C.  S.  Raghavendra 

Session  V :  Invited  Case  Studies 

Chair:  Noe  Lopez-Benitez,  Texas  Tech  University,  Lubbock,  TX,  USA 

Metacomputing  with  MILAN . 

A.  Baratloo,  P.  Dasgupta,  V.  Karamcheti,  and  Z  M.  Kedem 

An  Overview  of  MSHN;  The  Management  System  for  Heterogeneous  Networks . 

Debra  A.  Hensgen,  Taylor  Kidd,  David  St.  John,  Matthew  C.  Schnaidt,  Howard  Jay  Siegel, 
Tracy  D.  Braun,  Muthucumaru  Maheswaran,  Shoukat  Ali,  Jong-Kook  Kim,  Cynthia  Irvine, 
Tim  Levin,  Richard  F.  Freund,  Matt  Kussow,  Michael  Godfrey,  Alpay  Duman,  Paul  Carff, 
Shirley  Kidd,  Viktor  Prasanna,  Prashanth  Bhat,  and  Ammar  Alhusaini 

QUIC:  A  Quality  of  Service  Network  Interface  Layer  for  Communication  in  NOWs . 

R.  West,  R.  Krishnamurthy,  W.  K.  Norton,  K.  Schwan,  S.  Yalamanchili,  M.  Rosu,  and  V.  Sarat 

Adaptive  Distributed  Applications  on  Heterogeneous  Networks . 

Thomas  Gross,  Peter  Steenkiste,  and  Jaspal  Subhlok 


137 

146 

156 


169 

184 


199 

209 


Author  Index 


219 


VI 


Message  from  the  General  Chair 


Welcome  to  the  8th  Heterogeneous  Computing  Workshop.  The  field  of  heterogeneous  computing  is 
motivated  by  the  diverse  requirements  of  large-scale  computational  tasks,  and  the  realization  that  the 
features  of  a  single  architecture  are  not  always  ideal  for  a  wide  range  of  task  requirements. 

HCW  ’99  is  the  result  of  the  dedication  and  hard  work  of  a  number  of  people.  I  thank  Richard  F.  Freund 
of  NOEMIX  for  founding  this  series  of  workshops  and  for  working  hard  to  ensure  its  continuity  ^d 
success.  Special  thanks  also  go  to  our  industrial  supporter,  NOEMIX,  for  providing  plaques  of  recognition 
to  be  awarded  to  individuals  who  have  contributed  to  the  workshop’s  success  over  the  years. 

Viktor  K.  Prasanna  of  the  University  of  Southern  California  is  this  year’s  Program  Chair.  I  have  worked 
with  Viktor  on  professional  activities  before,  and  as  always  he  went  above  and  beyond  the  call  of  duty. 
In  addition  to  the  Program  Chair’s  tasks,  he  also  took  responsibility  for  completing  tasks  that  I  probably 
should  have  taken  on  in  my  role  as  General  Chair.  So  I  owe  a  special  thank  you  to  Viktor.  With  the  able 
assistance  of  a  terrific  program  committee,  he  has  put  together  an  excellent  program  and  collection  of 
papers  in  these  proceedings. 

Thanks  are  also  due  to  the  Steering  Committee  members  for  their  guidance  and  support,  and  for  the 
confidence  they  had  in  asking  me  to  serve  as  this  year’s  General  Chair.  Special  thanks  go  to  H.  J.  Siegel, 
the  Steering  Committee  Chair,  who  did  a  remarkable  job  of  leading  that  group  and  conveying  ideas  to  me 
for  enhancing  the  quality  and  prestige  of  the  workshop.  H.  J.  worked  with  Viktor  and  myself  on 
numerous  occasions  with  regard  to  planning  and  decision-making  issues.  H.  J.’s  advice,  energy,  and 
dedication  to  this  workshop  series  are  truly  keys  to  its  overall  success. 

The  Publicity  Chair,  Muthucumaru  Maheswaran  of  the  University  of  Manitoba,  did  an  outstanding  job  of 
publicizing  the  workshop  through  print  and  on  the  web.  His  careful  and  prompt  updating  of  the 
workshop’s  web  page  was  especially  usefol  in  keeping  the  authors  informed  on  guidelines  for  final 
submission. 

HCW  ’99  is  being  held  in  conjunction  with  IPPS/SPDP  ’99,  the  second  merger  of  the  International 
Parallel  Processing  Symposium  (IPPS)  and  the  Symposium  on  Parallel  and  Distributed  Processing 
(SPDP).  I  thank  the  General  Co-Chairs  of  IPPS/SPDP  ’99,  Jose  Rohm  and  Charles  Weems,  for  their 
cooperation  and  assistance,  with  special  thanks  to  Jose  for  taking  on  the  responsibility  of  coordinating 
and  organizing  the  workshops  of  IPPS/SPDP  ’99. 

This  year,  the  workshop  is  cosponsored  by  the  IEEE  Computer  Society  and  the  U.S.  Office  of  Naval 
Research.  These  proceedings  are  published  by  the  IEEE  Computer  Society  Press.  Deborah  Plummer  and 
Lorretta  Palagi,  both  of  the  IEEE  Computer  Society  Press,  deserve  special  thanks  for  their  punctuality 
and  professionalism  in  overseeing  the  publication  of  these  proceedings.  Special  thanks  go  to  Lorretta  for 
her  efficient  handling  of  the  papers  included  here,  and  for  carefully  attending  to  the  many  details  required 
to  take  a  proceedings  to  press. 

I  would  also  like  to  thank  my  secretary,  Marcelia  Sawyers,  for  her  assistance  with  my  duties  related  to 
this  workshop.  Finally,  I  would  like  to  thank  my  wife,  Robin,  for  the  loving  support  and  patience  she  has 
for  me. 

John  K.  Antonio 
Texas  Tech  University 
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Message  from  the  Program  Chair 


The  papers  published  in  these  proceedings  represent  some  of  the  results  from  leading  researchers  in 
heterogeneous  computing  (HC).  The  field  of  heterogeneous  computing  has  matured  over  the  years.  A 
number  of  experimental  as  well  as  commercial  sytems  continue  to  be  built  that  integrate  hardware, 
software,  and  algorithms  to  realize  high-performance  systems  that  satisfy  diverse  computational  needs. 

The  response  to  the  call  for  participation  was  excellent.  Submissions  were  sent  out  to  the  program 
committee  members  for  evaluation.  In  addition  to  their  own  reviews,  the  program  committee  members 
sought  outside  reviews  to  evaluate  the  submissions.  The  final  selection  of  manuscripts  was  made  at  USC 
on  November  24,  1998.  The  contributed  papers  were  grouped  into  four  sessions:  Comparisons  for 
Mapping  Heuristics,  Design  Tools,  Modeling  and  Analysis,  and  Task  Assignment  and  Scheduling.  In 
addition  to  contributed  papers,  the  program  includes  a  session  of  Invited  Case  Studies.  I  believe  the  papers 
represent  continuing  work  as  the  field  matures  and  I  expect  to  see  revised  versions  of  these  papers  appear 
in  archival  journals. 

I  woidd  like  to  thank  many  volunteers  for  their  support.  First  of  all,  I  want  to  thank  John  Antonio, 
General  Chair,  H.  J.  Siegel,  Steering  Committee  Chair,  and  Richard  Freund,  who  initiated  the  HCW  series, 
for  inviting  me  to  be  the  program  chair.  Over  the  past  year,  John  and  H.  J.  provided  me  with  a  number  of 
pointers  to  resolve  meeting-related  issues.  I  want  to  thank  them  for  their  invaluable  inputs  in  composing 
a  strong  technical  program.  It  was  truly  a  pleasure  working  with  them. 

I  would  like  to  thank  the  authors  for  submitting  their  work  and  the  program  committee  members  and  the 
reviewers  for  their  efforts  in  reviewing  the  manuscripts.  I  would  also  like  to  thank  Lorretta  Palagi  for  her 
patience  in  working  with  late  camera-ready  submissions  and  for  her  prompt  response  to  proceedings- 
related  questions. 

Finally,  I  am  thankful  to  my  assistant  Henryk  Chrostek  who  handled  the  submitted  manuscripts  in  a 
timely  manner. 

Viktor  K.  Prasanna 
University  of  Southern  California 
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Message  from  the  Steering  Committee  Chair 


These  are  the  proceedings  of  the  8th  Heterogeneous  Computing  Workshop,  also  known  as  HCW  ’99. 
Heterogeneous  computing  is  a  very  important  research  area  with  great  practical  impact.  The  topic  of 
heterogeneous  computing  covers  many  types  of  systems.  A  heterogeneous  system  may  be  a  set  of 
machines  interconnected  by  a  wide-area  network  and  used  to  support  the  execution  of  jobs  submitted  by  a 
large  variety  of  users  to  process  data  that  is  distributed  throughout  the  system.  A  heterogeneous  system 
may  be  a  suite  of  high-performance  machines  tightly  interconnected  by  a  fast  dedicated  local-area 
network  and  used  to  process  a  set  of  production  tasks,  where  the  subtasks  of  each  task  may  execute  on 
different  machines  in  the  suite.  A  heterogeneous  system  may  also  be  a  special-purpose  embedded  system, 
such  as  a  set  of  different  types  of  processors  used  for  automatic  target  recognition.  In  the  extreme,  a 
heterogeneous  system  may  consist  of  a  single  machine  that  can  reconfigure  itself  to  operate  in  different 
ways  (e.g.,  in  different  modes  of  parallelism).  All  of  these  types  of  heterogeneous  systems  (as  well  as 
others)  are  appropriate  topics  for  this  workshop  series.  I  hope  you  find  the  contents  of  these  proceedings 
informative  and  interesting,  and  encourage  you  to  look  also  at  the  proceedings  of  past  and  future  HCWs. 

Many  people  have  worked  very  hard  to  make  this  workshop  happen.  Viktor  Prasanna,  University  of 
Southern  California,  is  this  year’s  Program  Chair,  and  he  assembled  the  great  program  that  is  represented 
by  the  papers  in  these  proceedings.  Viktor  did  this  with  the  assistance  of  his  Program  Committee,  which 
is  listed  on  the  next  page.  John  Antonio,  Texas  Tech  University,  is  the  General  Chair,  and  he  is 
responsible  for  the  overall  organization  and  administration  of  this  year’s  workshop,  and  he’s  done  an 
outstanding  job.  I  thank  Richard  F.  Freimd,  NOEMIX,  for  foimding  this  workshop  series,  and  for  asking 
me  to  succeed  him  as  Chair  of  the  Steering  Committee. 

This  year  the  workshop  is  cosponsored  by  the  IEEE  Computer  Society  and  the  U.S.  Office  of  Naval 
Research,  with  additional  support  from  our  industrial  affiliate  NOEMIX.  I  thank  Andre  M.  van  Tilborg, 
Director  of  the  Math,  Computer,  &  Information  Sciences  Division  of  the  Office  of  Naval  Research,  for 
arranging  funding  for  the  publication  of  the  workshop  proceedings  (under  grant  number  N00014-99-1- 
0117).  I  thank  Richard  F.  Freund,  NOEMIX,  for  providing  the  plaque  given  to  Viktor  in  recognition  of 
his  efforts  as  Program  Chair. 

This  workshop  is  held  in  conjunction  with  the  Merged  International  Parallel  Processing  Symposium  & 
Symposium  on  Parallel  and  Distributed  Processing  (IPPS/SPDP).  The  HCW  series  is  very  appreciative  of 
the  constant  cooperation  and  assistance  we  have  received  from  the  IPPS/SPDP  organizers. 

H.  J.  Siegel 

School  of  Electrical  and  Computer  Engineering 
Purdue  University 
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Abstract 

Scheduling  computation  tasks  on  processors  is  the 
key  issue  for  high-performance  computing.  Although 
a  large  number  of  scheduling  heuristics  have  been  pre¬ 
sented  in  the  literature^  most  of  them  target  only  homo¬ 
geneous  resources.  The  existing  algorithms  for  hetero¬ 
geneous  domains  are  not  generally  efficient  because  of 
their  high  complexity  and/or  the  quality  of  the  results. 
We  present  two  low- complexity  efficient  heuristics, 
the  Heterogeneous  Earliest-Finish-Time  (HEFT)  Al¬ 
gorithm  and  the  Critical-Path-on-a-Processor  ( CPOP) 
Algorithm  for  scheduling  directed  acyclic  weighted  task 
graphs  (DAGs)  on  a  bounded  number  of  heterogeneous 
processors.  We  compared  the  performances  of  these  al¬ 
gorithms  against  three  previously  proposed  heuristics. 
The  comparison  study  showed  that  our  algorithms  out¬ 
perform  previous  approaches  in  terms  of  performance 
(schedule  length  ratio  and  speedup)  and  cost  (time- 
complexity). 


1.  Introduction 

Efficient  scheduling  of  application  tasks  is  critical  to 
achieving  high  performance  in  parallel  and  distributed 
systems.  The  objective  of  scheduling  is  to  map  the 
tasks  onto  the  processors  and  order  their  execution 
so  that  task  precedence  requirements  are  satisfied  and 
minimum  schedule  length  (or  makespan)  is  given.  Since 
the  general  DAG  scheduling  is  NP-complete,  there  are 
many  research  efforts  that  have  proposed  heuristics  for 


the  task  scheduling  problem  [1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 
11,  12,  13]. 

Although  a  wide  variety  of  different  approaches  are 
used  to  solve  the  DAG  scheduling  problem,  most  of 
them  target  only  for  homogeneous  processors.  The 
scheduling  techniques  that  are  suitable  for  homoge¬ 
neous  domains  are  limited  and  may  not  be  suit¬ 
able  for  heterogeneous  domains.  Only  a  few  meth¬ 
ods  [5,  4,  9,  10]  use  variable  execution  times  of  tasks  for 
heterogeneous  environments;  however,  they  are  either 
high-complexity  algorithms  and/or  they  do  not  gener¬ 
ally  provide  good  quality  of  results. 

In  this  paper  we  propose  two  static  DAG  scheduling 
algorithms  for  heterogeneous  environments.  They  are 
for  a  bounded  number  of  processors  and  are  based  on 
list-scheduling  heuristics.  The  Heterogeneous  Earliest- 
Finish-Time  (HEFT)  Algorithm  selects  the  task  with 
the  highest  upward  rank  (defined  in  Section  2)  at  each 
step;  then  the  task  is  assigned  to  the  most  suitable 
processor  that  minimizes  the  earliest  finish  time  with 
an  insertion-based  approach.  The  Critical-Path-on-a- 
Processor  (CPOP)  Algorithm  schedules  critical-path 
nodes  onto  a  single  processor  that  minimizes  the  crit¬ 
ical  path  length.  For  the  other  nodes,  the  task  selec¬ 
tion  phase  of  the  algorithm  is  based  on  a  summation  of 
downward  and  upward  ranks;  the  processor  selection 
phase  is  based  on  the  earliest  execution  finish  time, 
as  in  the  HEFT  Algorithm.  The  simulation  study  in 
Section  5  shows  that  our  algorithms  considerably  out¬ 
perform  previous  approaches  in  terms  of  performance 


0-7695-0107-9/99  $10.00  ©  1999  IEEE 


3 


(schedule  length  ratio  and  speed-up)  and  cost  (time- 
complexity). 

The  remainder  of  this  paper  is  organized  as  follows. 
The  next  section  gives  the  background  of  the  schedul¬ 
ing  problem,  including  some  definitions  and  parame¬ 
ters  used  in  the  algorithms.  In  Section  3  we  present 
the  proposed  scheduling  algorithms  for  heterogeneous 
domains.  Section  4  contains  a  brief  review  on  the  re¬ 
lated  scheduling  algorithms  that  will  be  used  in  our 
comparison,  and  in  Section  5  the  performances  of  our 
algorithms  are  compared  with  the  performances  of  re¬ 
lated  work,  using  task  graphs  of  some  real  applications 
and  randomly  generated  tasks  graphs.  Section  6  in¬ 
cludes  the  conclusion  and  future  work. 

2.  Problem  Definition 

A  parallel/distributed  application  is  decomposed 
into  multiple  tasks  with  data  dependencies  among 
them.  In  our  model  an  application  is  represented  by 
a  directed  acyclic  graph  (DAG)  that  consists  of  a  tu¬ 
ple  G  =  {Vy  E,  Py  Wy  data,  r ate)  ^  where  V  is  the  set 
of  V  nodes/tasks,  E  is  the  set  of  e  edges  between  the 
nodes,  and  P  is  the  set  of  processors  available  in  the 
system.  (In  this  paper  task  and  node  terms  are  used 
interchangeably  used.)  Each  edge  (iyj)  G  E  repre¬ 
sents  the  task-dependency  constraint  such  that  task 
m  should  complete  its  execution  before  task  nj  can  be 
started. 

W  is  a  V  X  p  computation  cost  matrix,  where  v  is 
the  number  of  tasks  and  p  is  the  number  of  processors 
in  the  system.  Each  Wij  gives  the  estimated  execution 
time  to  complete  task  n,*  on  processor  pj .  The  average 
execution  costs  of  tasks  are  used  in  the  task  priority 
equations.  The  average  execution  cost  of  a  node  n,-  is 
defined  as  Wi  =  Data  is  a  v  x  t;  matrix  for 

data  transfer  size  (in  bytes)  between  the  tasks.  The 
data  transfer  rates  (in  bytes/second)  between  proces¬ 
sors  are  stored  in  a  p  x  p  matrix,  rate. 

The  communication  cost  of  the  edge  which 

is  for  data  transfer  from  task  ni  (scheduled  on  p^) 
to  task  Uj  (scheduled  on  Pn),  is  defined  by  Cij  = 
data{niynj)/rate{pm,Pn)-  When  both  n,-  and  nj  are 
scheduled  on  the  same  processor,  Pm  =  Pn,  then  Cij 
becomes  zero,  since  the  intra-processor  communica¬ 
tion  cost  is  negligible  compared  with  the  interprocessor 
communication  cost.  The  average  communication  cost 
of  an  edge  is  defined  by  cJJ  =  data{niy  nj)/ratey  where 
rate  is  the  average  transfer  rate  between  the  processors 
in  the  domain. 


The  EST{niyPj)  and  EFT{niypj)  are  the  earliest 
execution  start  time  and  the  earliest  execution  finish 
time  of  node  n,  on  processor  pj ,  respectively.  They  are 
defined  by 

ESTim.pj)  =  max{T^vailable[j]y  max 

nm.epred{ni) 

{EFT{nm,Pk)  +  Cm,i)  }  (1) 

EFT{niyPj)  =  Wij  +  EST{niyPj)  .  (2) 

where  pred{ni)  is  the  set  of  immediate  predecessors  of 
task  n,-,  and  TJivailable\j]  is  the  earliest  time  at  which 
processor  pj  is  available  for  task  execution.  The  inner 
max  block  in  the  EST  equation  returns  the  ready  time, 
i.e.,  the  time  when  all  data  needed  by  n,-  has  arrived 
at  the  host  pj.  The  assignment  decisions  are  stored 
in  a  two-dimensional  matrix  list.  The  j’th  row  of  list 
matrix  is  for  the  sequence  of  nodes  (in  the  order  of  ex¬ 
ecution  start  time)  that  was  already  scheduled  on  pj. 

The  objective  function  of  the  scheduling  problem  is 
to  determine  an  assignment  of  tasks  of  a  given  appli¬ 
cation  to  processors  such  that  the  schedule  length  (or 
makespan)  is  minimized  by  satisfying  all  precedence 
constraints.  After  all  nodes  in  the  DAG  are  scheduled, 
the  schedule  length  will  be  the  earliest  finish  time  of 
the  exit  node  ne,  EFT[neyPj)y  where  exit  node  is 
scheduled  to  processor  pj .  (If  a  graph  has  multiple  exit 
nodes,  they  are  connected  with  zero- weight  edges  to  a 
pseudo  exit  node  that  has  zero  computation  cost.  Sim¬ 
ilarly,  a  pseudo  start  node  is  added  to  the  graphs  with 
multiple  start  nodes) . 

The  critical  path  (CP)  of  a  DAG  is  the  longest  path 
from  the  entry  node  to  the  exit  node  in  the  graph.  The 
length  of  this  path,  \CP\y  is  the  sum  of  the  computa¬ 
tion  costs  of  the  nodes  and  inter-node  communication 
costs  along  the  path.  The  \CP\  value  of  a  DAG  is  the 
lower  bound  of  the  schedule  length. 

In  our  algorithms  we  rank  tasks  as  upward  and 
downward  to  set  the  scheduling  priorities.  The  upward 
rank  of  a  task  n*  is  recursively  defined  by 

rankuirii)  =  W  +  max  (^  +  ranku{nj))  (3) 

nj£succ(ni) 

where  succ{ni)  is  the  set  of  immediate  successors  of 
task  Ui .  Since  it  is  computed  recursively  by  traversing 
the  task  graph  upward,  starting  from  the  exit  node,  it 
is  referred  to  as  an  upward  rank.  Basically,  ranku{ni) 
is  the  length  of  the  critical  path  (i.e.,  the  longest  path) 
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from  Hi  to  the  exit  node,  including  the  computation 
cost  of  the  node  itself.  In  some  previous  algorithms 
the  ranks  of  the  nodes  are  computed  using  computa¬ 
tion  costs  only,  which  is  referred  to  as  static  upward 
rank,  rank^. 

Similarly,  the  downward  rank  of  a  task  n*  is  recur¬ 
sively  defined  by 

rankd{ni)  =  max  {rankd{nj)  -f  (4) 

njepred{ni) 

The  downward  ranks  are  computed  recursively  by 
traversing  the  task  graph  downward  Basically,  the 
rankd{ni)  is  the  longest  distance  from  the  start  node 
to  the  node  Ui ,  excluding  the  computation  cost  of  the 
node  itself. 

In  some  previous  algorithms  the  level  attribute  is 
used  to  set  the  priorities  of  the  tasks.  The  level  of  a 
task  is  computed  by  the  maximum  number  of  edges 
along  any  path  to  the  task  from  the  start  node.  The 
start  node  has  a  level  of  zero, 

3,  Proposed  Algorithms 

In  this  section,  we  present  our  scheduling  algorithms 
the  Heterogeneous  Earliest  Finish  Time  (HEFT)  Al¬ 
gorithm  and  the  Critical-Path-on-a-Processor  (CPOP) 
Algorithm. 

3.1.  The  HEFT  Algorithm 

The  Heterogeneous-Earliest-Finish-Time  (HEFT) 
Algorithm,  as  shown  in  Figure  1,  is  a  DAG  scheduling 
algorithm  that  supports  a  bounded  number  of  hetero¬ 
geneous  processing  elements  (PEs).  To  set  priority  to 
a  task  Hi ,  the  HEFT  algorithm  uses  the  upward  rank 
value  of  the  task,  ranku  (Equation  3),  which  is  the 
length  of  the  longest  path  from  n,-  to  the  exit  node. 
The  ranku  calculation  is  based  on  mean  computation 
and  communication  costs.  The  task  list  is  generated  by 
sorting  the  nodes  with  respect  to  the  decreasing  order 
of  the  ranku  values.  In  our  implementation  the  ties 
are  broken  randomly;  i.e.,  if  two  nodes  to  be  scheduled 
have  equal  ranku  values,  one  of  them  is  selected  ran¬ 
domly. 

The  HEFT  algorithm  uses  the  earliest  finish  time 
value,  EFTj  to  select  the  processor  for  each  task.  In 
noninsertion-based  scheduling  algorithms,  the  earliest 
available  time  of  a  processor  pj,  the  T^vailable[j] 
term  in  Equation  1,  is  the  execution  completion  time 
of  the  last  assigned  node  on  pj.  The  HEFT  Algorithm, 


which  is  insertion-based,  considers  a  possible  insertion 
of  each  task  in  an  earliest  idle  time  slot  between  two 
already-scheduled  tasks  on  the  given  processor.  For¬ 
mally,  node  rij  can  be  scheduled  on  processor  pj ,  which 
holds  the  following  in  equality  for  the  minimum  value 
of  k 

EST{listj^k+i  ,  Pj)  -  EFT{listj,k  ,  Pj)  >  Wij  (5) 

where  the  listj^k  is  the  Arth  node  (in  the  start  time 
sequence)  that  was  already  assigned  on  the  pro¬ 
cessor  Pj.  Then,  T^vailable[j]  will  be  equal  to 
EFT  {list  j^k  ,  Pi)-  The  time  complexity  of  the  HEFT 
Algorithm  is  equal  to  0{v‘^  x  p). 

3.2.  The  CPOP  Algorithm 

The  Critical-Path-on-a-Processor  (CPOP)  Algo¬ 
rithm,  shown  in  Figure  2,  is  another  heuristic  for 
scheduling  tasks  on  a  bounded  number  of  heteroge¬ 
neous  processors.  The  ranku  and  rankd  attributes  of 
nodes  are  computed  using  mean  computation  and  com¬ 
munication  costs.  The  critical  path  nodes  {CPNi)  are 
determined  at  Steps  5-6.  The  critical-path-processor 
(CPP)  is  the  one  that  minimizes  the  length  of  the 
critical  path  (Step  7).  The  CPOP  Algorithm  uses 
rankd{ni) ranku{ni)  to  assign  the  node  priority.  The 
processor  selection  phase  has  two  options:  If  the  cur¬ 
rent  node  is  on  the  critical  path,  it  is  assigned  to  the 
critical  path  processor  (CPP);  otherwise,  it  is  assigned 
to  the  processor  that  minimizes  the  execution  comple¬ 
tion  time.  The  latter  option  is  insertion-based  (as  in 
the  HEFT  algorithm).  At  each  iteration  we  maintain 
a  priority  queue  to  contain  ail  free  nodes  and  select  the 
node  that  maximizes  rankd{ni)  ranku{ni) .  A  binary 
heap  was  used  to  implement  the  priority  queue,  which 
has  time  complexity  of  0{logv)  for  insertion  and  dele¬ 
tion  of  a  node  and  0(1)  for  retrieving  the  node  with  the 
highest  priority  (the  root  node  of  the  heap) .  The  time 
complexity  of  the  algorithm  is  0{v^  x  p)  for  v  nodes 
and  p  processors. 

4.  Related  Work 

Only  a  few  of  the  proposed  task  scheduling  heuris¬ 
tics  support  variable  computation  and  communica¬ 
tion  costs  for  heterogeneous  domains:  the  Dynamic 
Level  Scheduling  (DLS)  Algorithm  [4],  the  Levelized- 
Min  Time  (LMT)  Algorithm  [9],  and  the  Mapping 
Heuristic  (MH)  Algorithm  [5].  Although  there  are  ge¬ 
netic  algorithm  based  research  efforts  [6,  10,  14],  most 
of  them  are  slow  and  usually  do  not  perform  as  well  as 
the  list-scheduling  algorithms. 
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1.  Compute  ranku  for  all  nodes  by  traversing  graph  upward,  starting  from  the  exit  node. 

3.  Sort  the  nodes  in  a  list  by  nonincreasing  order  of  ranku  values. 

4.  while  there  are  unscheduled  nodes  in  the  list  do 

5.  begin 

6.  Select  the  first  task  m  in  the  list  and  remove  it. 

7.  Assign  the  task  m  to  the  processor  pj  that  minimizes  the  (EFT)  value  of  m. 

8.  end  _  — 


Figure  1.  The  HEFT  Algorithm 


1.  Compute  ranku  for  cJl  nodes  by  traversing  graph  upward,  starting  from  the  exit  node. 

2.  Compute  rankd  for  all  nodes  by  traversing  graph  downward,  starting  from  the  start  node, 

3.  \CP\  =  ranku{n3)i  where  ns  is  the  start  node. 

4.  For  each  node  ni  do 

5.  If  {rankd{ni)  +  ranku{ni)  —  \CP\)  then 

6.  ni  is  a  critical  path  node  (CPN). 

7.  Select  the  critical-path-processor  that  minimizes 

8.  Initialize  the  priority-queue  with  the  entry  nodes. 

9.  while  there  is  an  unscheduled  node  in  the  priority-queue  do 

10.  begin 

11.  Select  the  highest  priority  node  from  priority-queue, 

12.  which  maximizes  rankd{ni)  ranku{ni). 

13.  if  (rit  is  a  CPN)  then 

14.  schedule  ni  to  critical-path-processor. 

15.  else 

16.  Assign  the  task  m  to  the  processor  pj  which  minimizes  the  {EFT)  value  of  n». 

17.  Update  the  priority-queue  with  the  successor (s)  of  n*  if  they  become  ready-nodes. 

18.  end  _  . 


Figure  2.  The  CROP  Algorithm 


Mapping  Heuristic  (MH)  Algorithm  The  MH 
Algorithm  uses  static  upward  ranks  (rank^)  to  assign 
priorities  to  the  nodes.  A  ready  node  list  is  kept  sorted 
according  to  the  decreasing  order  of  priorities.  (For 
tie-breaking,  the  node  with  the  largest  number  of  im¬ 
mediate  successors  is  selected.)  With  a  noninsertion- 
based  method,  the  processor  that  provides  the  mini¬ 
mum  earliest  finish  time  of  a  task  is  selected  to  run  the 
task.  After  a  task  is  scheduled,  the  immediate  succes¬ 
sors  of  the  task  are  inserted  into  the  list.  These  steps 
are  repeated  until  all  nodes  are  scheduled.  The  time 
complexity  of  this  algorithm  is  0{v^  x  p)  for  v  nodes 
and  p  processors. 

Dynamic-Level  Scheduling  (DLS)  Algorithm 
The  DLS  Algorithm  assigns  node  priorities  by  using 
an  attribute  called  Dynamic  Level  (DL)  that  is  equal 
to  DL{ni,pj)  =  ranku{ni)-EST{ni,Pj).  (In  contrast 


to  the  mean  values,  median  values  are  used  to  compute 
the  static  upward  ranks;  and  for  the  EST  computation, 
the  noninsertion  method  is  used).  At  each  scheduling 
step  the  algorithm  selects  the  (ready  node,  available 
processor)  pair  that  maximizes  the  dynamic  level  value. 
For  heterogeneous  environments  the  A{ni^pj)  term  is 
added  to  the  dynamic  level  computation.  The  A  value 
for  a  task  on  a  processor  is  computed  by  the  difference 
between  the  task’s  median  execution  time  on  all  pro¬ 
cessors  and  its  execution  time  on  the  current  processor. 
The  DLS  Algorithm  has  an  0{v^p  x  f{p))  time  com¬ 
plexity,  where  v  is  the  number  of  tasks  and  p  is  the 
number  of  processors.  The  complexity  of  the  function 
used  for  data  routing  to  calculate  the  earliest  start  time 
is  0{f{p))  [4]. 
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Levelized-Min  Time  Algorithm.  This  is  a  two- 
phase  algorithm.  The  first  phase  orders  the  tasks  based 


on  their  precedence  constraints,  i.e.,  level  by  level.  This 
phase  groups  the  tasks  that  can  be  executed  in  parallel. 
The  second  phase  is  a  greedy  method  that  assigns  each 
task  (level  by  level)  to  the  “fastest”  available  processor 
as  much  as  possible.  A  task  in  a  lower  level  has  higher 
priority  for  scheduling  than  a  node  in  a  higher  level; 
within  the  same  level,  the  task  with  the  highest  aver¬ 
age  computation  cost  has  the  highest  priority.  If  the 
number  of  tasks  in  a  level  is  greater  than  the  number 
of  available  processors,  the  fine-grain  tasks  are  merged 
into  a  coarse-grain  task  until  the  number  of  tasks  is 
equal  to  the  number  of  processors.  Then  the  tasks  are 
sorted  in  reverse  order  (largest  task  first)  based  on  av¬ 
erage  computation  time.  Beginning  from  the  largest 
task,  each  task  will  be  assigned  to  the  processor:  a) 
that  minimizes  the  sum  of  computation  cost  of  the  task 
and  the  communication  costs  with  tasks  in  the  previ¬ 
ous  layers;  and  b)  that  does  not  have  any  scheduled 
task  at  the  same  level.  For  a  fully-connected  graph, 
the  time  complexity  is  when  there  are  v  tasks 

and  p  processors. 


is  computed  by  assigning  all  tasks  to  a  single  pro¬ 
cessor  that  minimizes  the  total  computation  time 
of  the  DAG).  Formally,  it  is  defined  as: 

Speedup  =  - makespan - ’ 


•  Running  Time.  This  is  the  average  cost  of  each 
scheduling  algorithm  for  obtaining  the  schedules 
of  the  given  graphs  on  a  Sun  SPARC  10  work¬ 
station.  The  trade-offs  between  the  performance 
(SLR  value)  and  cost  (running  time)  of  scheduling 
algorithms  were  given  in  this  section. 


5.1.  Generating  Random  Task  Graphs 


We  developed  an  algorithm  to  generate  random  di¬ 
rected  acyclic  graphs  that  are  used  to  evaluate  the  pro¬ 
posed  scheduling  algorithms.  The  input  parameters 
required  to  generate  a  weighted  random  DAG  are  the 
following: 

•  Number  of  nodes  (tasks)  in  the  graph,  v. 


5.  Performance  and  Comparison 

In  this  section  we  present  the  performance  compar¬ 
ison  of  our  algorithms  with  the  related  work,  using 
the  randomly  generated  task  graphs  and  regular  task 
graphs  representing  the  applications.  The  following 
metrics  were  used  to  compare  the  performances  of  our 
proposed  algorithms  with  the  previous  approaches. 

•  Schedule  Length  Ratio  (SLR).  The  SLR  of  an 
algorithm  is  defined  as 

_  _ makespan _ 

/  n  j  €  C  /  jv  PI 

where  makespan  is  the  schedule  length  of  the 
algorithm’s  output  schedule.  The  denominator 
is  the  summation  of  computation  costs  of  nodes 
on  the  CPmin-  (For  an  unscheduled  DAG,  if 
the  computation  cost  of  each  node  rii  is  set  with 
{minjep{n^i,j}}5  then  the  resulting  critical  path, 
CPmin,  will  be  based  on  minimum  computation 
values).  The  SLR  value  of  any  algorithm  for  a 
DAG  cannot  be  less  than  one,  since  the  denomina¬ 
tor  in  the  equation  is  the  lower  bound  for  comple¬ 
tion  time  of  the  graph.  The  algorithm  that  gives 
the  lowest  SLR  value  of  a  DAG  is  the  best  algo¬ 
rithm  with  respect  to  performance  (i.e.,  the  one 
that  gives  the  minimum  overall  execution  time) . 

•  Speedup.  The  speedup  value  of  an  algorithm 
is  computed  by  dividing  the  overall  computation 
time  of  the  sequential  execution  to  the  makespan 
of  the  algorithm.  (The  sequential  execution  time 


•  Shape  parameter  of  the  graph,  a.  We  assume  that 
the  height  of  a  DAG  is  randomly  generated  from  a 
uniform  distribution  with  mean  equal  to  a  x 
The  width  for  each  level  in  a  DAG  is  randomly 
selected  from  a  uniform  distribution  with  mean 
equal  to  If  a  =  1.0,  then  it  will  be  a  balanced 
DAG.  A  dense  DAG  (a  shorter  graph  with  high 
parallelism)  can  be  generated  by  selecting  a  » 
1.0.  Similarly,  if  a  <<  1.0,  it  will  generate  a  longer 
DAG  with  a  small  parallelism  degree. 

•  Out  degree  of  a  node,  out. degree.  One  method  is 
to  use  a  fixed  oui.degree  value  for  all  nodes  in  a 
DAG  as  much  as  possible.  Another  alternative  is 
to  use  a  mean  out. degree.  Then,  each  node’s  out- 
degree  will  be  randomly  generated  from  a  uniform 
distribution  with  mean  equal  to  out. degree. 

•  Communication  to  computation  ratio,  CCR. 
CCR  is  the  ratio  of  the  average  communication 
cost  to  the  average  computation  cost.  If  a  DAG’s 
CCR  value  is  low,  it  can  be  considered  as  a 
computation-intensive  application;  if  it  is  high,  it 
can  be  considered  as  a  communication-intensive 
application. 

•  Average  computation  cost  in  the  graph,  avg.comp. 
Computation  costs  are  generated  randomly  from 
a  uniform  distribution  with  mean  equal  to 
avg.comp.  Similarly,  the  average  communication 
cost  is  derived  as  avg.comm  =  CCR  x  avg.comp. 
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•  Range  percentage  of  computation  costs  on  proces^ 
sors,  p.  A  high  /?  value  causes  significant  dif¬ 
ference  of  a  node’s  computation  cost  among  the 
processors;  a  very  low  ^  value  means  that  the  ex¬ 
pected  execution  times  of  a  node  on  any  given  pro¬ 
cessors  are  almost  equal.  If  the  average  computa¬ 
tion  cost  of  a  node  rii  is  WJ,  then  the  computation 
cost  of  Hi  on  any  processor  pj  will  be  randomly 
selected  from  the  range,  x  (1  —  f )  <  Wij  < 
uJf  X  (1  -h  f ),  where  /?  is  the  range  percentage. 

•  Number  of  available  processors  in  the  system, 
numjpe. 

To  compare  the  scheduling  algorithms  presented,  we 
used  random  graphs  as  a  test  suite.  The  input  parame¬ 
ters  of  the  directed  acyclic  graph  generation  algorithm 
were  set  with  the  following  values: 

•  t;  =  {20,40,60,80,100} 

•  CCii=  {0.1,0.5,1.0,5.0,10.0} 

•  a  =  {0.5, 1.0, 2.0} 

•  =  {1,2, 3,4,5, 100} 

•  /?=  {0.1,0.25,0.5,0.75,1.0} 

These  combinations  give  2250  different  DAG  types. 
Since  we  generated  25  random  DAGs  for  each  DAG 
type,  the  total  number  of  DAGs  used  in  our  compar¬ 
ison  study  is  around  56250.  The  following  part  gives 
the  algorithm  rankings  for  several  graph  parameters. 
Each  ranking  starts  with  the  best  algorithm  and  ends 
with  the  worst  one,  with  respect  to  a  given  parameter. 

Performance  Study  with  Respect  to  Graph  Size 

The  performances  (SLR  values)  of  algorithms  were 
compared  with  respect  to  different  graph  sizes  in  Fig¬ 
ure  3(a).  The  SLR  value  for  each  graph  size  is  the 
average  SLR  value  of  11250  different  graphs  that  were 
generated  randomly  with  different  CCR^  a,  /?,  and 
out-degree  values  when  the  number  of  available  pro¬ 
cessors  was  equal  to  4.  The  performance  ranking  of 
the  algorithms  will  be  {HEFT,  CPOP,  DLS,  MH,  LMT 
}.  The  average  SLR  value  of  HEFT  on  all  generated 
graphs  will  be  better  than  CPOP  by  7%,  DLS  Algo¬ 
rithm  by  8%,  MH  Algorithm  by  16%  and  LMT  Algo¬ 
rithm  by  52%.  The  average  speedup  curve  with  respect 
to  the  number  of  nodes  is  given  in  Figure  3(b).  The 
average  speedup  ranking  of  the  algorithms  is  {HEFT, 
DLS,  (CPOP=MH),  LMT}.  We  repeated  these  tests 
for  the  case  when  the  number  of  processors  was  equal 
to  10.  In  this  case,  although  the  average  SLR  values 
were  lower  than  in  the  previous  case,  they  gave  the 


same  performance  ranking  of  the  algorithms. 


(a) 


(b) 


Figure  3.  (a)  Average  SLR  (b)  Average 
Speedup  with  Respect  to  Graph  Size 

Figure  4  shows  the  average  running  time  of  each  al¬ 
gorithm.  From  this  figure  it  can  be  concluded  that  the 
HEFT  algorithm  is  the  fastest  and  the  DLS  algorithm 
is  slowest  among  the  given  algorithms.  When  the  aver¬ 
age  running  time  of  the  algorithms  on  all  graphs  is  com¬ 
puted  by  combining  the  results  of  different  graph  sizes, 
the  HEFT  Algorithm  will  be  faster  than  the  CPOP  Al¬ 
gorithm  by  10%,  the  MH  Algorithm  by  32%,  the  DLS 
Algorithm  by  84%,  and  the  LMT  Algorithm  by  48%. 

Performance  Study  with  Respect  to  Graph 
Structure 
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Figure  4.  Average  Running  Time  of  the  Algo¬ 
rithms  with  Respect  to  Graph  Size 


When  a  (the  shape  parameter  of  a  graph)  is  equal 
to  0.5,  the  generated  graphs  have  longer  depths  with 
a  low  parallelism  degree.  If  the  average  SLR  values  of 
the  algorithms  are  compared  when  a  —  0.5,  the  per¬ 
formance  of  the  HEFT  Algorithm  is  better  than  the 
CPOP  Algorithm  by  8%,  the  MH  Algorithm  by  12%, 
the  DLS  Algorithm  by  6%,  and  the  LMT  Algorithm 
by  40%.  When  a  is  equal  1.0,  the  average  depth  of  the 
graph  and  average  width  of  each  layer  in  the  graph  will 
be  approximately  equal.  For  this  case,  the  average  SLR 
value  of  the  HEFT  Algorithm  will  be  better  than  the 
CPOP  Algorithm  by  7%,  the  MH  Algorithm  by  14%, 
the  DLS  Algorithm  by  7%,  and  the  LMT  Algorithm  by 
34%.  Similarly,  if  a  =  2.0,  the  average  width  will  be 
approximately  equal  to  four  times  the  average  depth, 
which  will  come  up  with  short  and  dense  graphs.  For 
this  case,  the  HEFT  Algorithm  will  be  better  than  the 
CPOP  Algorithm  by  6%,  the  MH  Algorithm  by  15%, 
the  DLS  Algorithm  by  8%,  and  the  LMT  Algorithm 
by  31%.  For  all  three  a  values,  the  HEFT  Algorithm 
gives  the  best  performance  among  the  five  algorithms. 


(a) 


CCR 


(b) 


Performance  Study  with  Respect  to  CCR 
Figure  5  gives  the  average  SLR  of  the  algorithms 
when  the  CCR  value  in  [0.1, 1.0]  and  [1.0, 5.0]  with 
number  of  processors  is  equal  to  10.  When  CCR  > 
0.25,  the  HEFT  Algorithm  outperforms  the  other  al¬ 
gorithms.  If  CCR  <  0.2,  the  DLS  algorithm  gives  a 
better  performance  than  the  HEFT  Algorithm.  The 
performance  ranking  of  the  algorithms  (when  CCR  < 
1.0)  is  {HEFT,  DLS,  MH,  CPOP,  LMT}.  When 
CCR  >  1.0  the  performance  ranking  changes  to 
{HEFT,  CPOP,  DLS,  MH,  LMT}. 


Figure  5.  Average  SLR  of  the  Algorithms  with 
Respect  to  CCR:  (a)  0.1  <  CCR  <  1.0  (b) 

1.0  <  CCR  <  10.0 
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5.2.  Applications 

In  this  section  we  present  a  performance  comparison 
of  the  scheduling  algorithms  based  on  Gaussian  elimi¬ 
nation  and  Fast  Fourier  Transformation  (FFT).  For  the 
Gaussian  elimination  algorithm,  we  did  the  analytical 
work  to  set  approximately  the  computation  costs  of  the 
nodes  and  communication  costs  of  the  edges.  The  num¬ 
ber  of  nodes  in  the  Gaussian  elimination  algorithm  can 
be  characterized  by  the  input  matrix  size.  For  FFT,  the 
computation  cost  of  each  node  is  randomly  set  from  a 
normal  distribution  with  mean  equal  to  an  assigned  av¬ 
erage  computation  cost.  The  communication  costs  are 
set  randomly  from  a  normal  distribution  with  mean 
equal  to  the  multiplication  of  OCR  and  average  com¬ 
munication  cost.  The  number  of  nodes  in  an  FFT  task 
graph  can  be  computed  in  terms  of  number  of  input 
points  (i.e.,  order  of  the  FFT). 

In  order  to  provide  different  processor  speeds,  the 
computing  power  weights  [18]  of  the  processors  were 
randomly  set  from  a  given  range.  The  computing  power 
weight  of  each  processor  Pj,  (v^),  is  a  number  that 
shows  the  ratio  of  CPU  speed  of  Pj  to  the  CPU  speed 
of  the  fastest  processor  in  the  system.  In  order  to 
set  the  computation  cost  of  each  task  on  processors, 
the  computation  cost  of  the  node  on  a  base  processor 
{Phase)  is  Set  (either  randomly  as  in  FFT  or  analyti¬ 
cally  as  in  the  Gaussian  elimination).  Then  the  com¬ 
putation  cost  for  each  other  processor  Pi  is  computed 
by  W{Tj,Pk)  =  ^  X  WiTj.Pbase)-  For  the  fastest 
processor,  Pp,  (p^  =  1. 

In  the  experiments  the  range  of  the  computing 
power  weights  can  be  [0.8  — 1.0],  [0.6  — 1.0],  [0, 4— 1.0]  or 
[0.2  -  1.0].  The  first  range  is  for  a  domain  in  which  the 
computational  powers  of  processors  are  almost  equal; 
however,  at  the  last  set  ([0.2— 1.0])  the  fastest  processor 
can  be  up  to  five  times  faster  than  the  slowest  one.  We 
also  varied  the  task  graph  granularities  by  varying  the 
CCR  values  in  the  range  {0.1, 0.5, 1.0, 2.0, 5.0, 10.0}. 

Gaussian'  elimination.  Figure  6(a)  gives  the  se¬ 
quential  program  for  the  Gaussian  elimination  algo¬ 
rithm  [1,  17].  The  data-flow  graph  of  the  algorithm  for 
the  special  case  of  n  =  5,  where  n  is  the  dimension 
of  the  matrix,  is  given  in  Figure  6(b).  The  number  of 
nodes  in  the  task  graph  of  this  algorithm  is  roughly 
Each  Tk^k  represents  a  pivot  column  oper¬ 
ation,  and  each  T/cj  represents  an  update  operation. 
The  node  computation  cost  function  for  any  task  Tkj 
is  equal  to  Wk,j  =  2  x  (n- Ar)  x  tt;,  where  I  <k  <  j  <n 
(ly  is  the  average  execution  time  of  either  an  addition 


and  multiplication,  or  of  a  division). 


Figure  6.  (a)  The  Gaussian  elimination  Algo¬ 
rithm  (kji  version)  (b)  The  Task  graph  for  a 
Matrix  of  Size  5 

The  edge  communication  cost  function  between  any 
edge  is  Ck,j  =  {n  -  k)  x  h,  where  h  is  the  trans¬ 
mission  rate.  Each  Gaussian  elimination  task  graph 
has  one  critical  path  (CP),  which  has  the  maxi¬ 
mum  number  of  nodes  covered  as  compared  to  the 
other  paths.  In  Figure  6(b),  the  critical  path  is 
Ti, iTi,2T2, 2^2,3^3,373, 474,4^4, 5- 

Figure  7(a)  gives  the  average  SLR  values  of  the  al¬ 
gorithms  at  various  matrix  sizes.  The  performances  of 
the  HEFT  and  DLS  algorithms  were  the  best  of  all. 
The  performance  ranking  of  the  algorithms  for  Gaus¬ 
sian  elimination  graphs  is  {(HEFT  =  DLS),  CPOP, 
MH,  LMT}.  As  in  the  results  of  other  applications, 
the  SLR  values  of  all  scheduling  algorithms  slightly  in¬ 
crease  if  the  matrix  size  is  increased.  Increasing  the 
matrix  size  causes  more  nodes  not  to  be  on  the  critical 
path,  which  causes  increases  in  the  makespan. 

Figure  7(b)  gives  the  efficiency  comparison  of  the  al¬ 
gorithms  for  the  Gaussian  elimination  graphs  of  about 
1250  nodes  (i.e.,  the  matrix  size  is  50)  at  various  num¬ 
bers  of  processors.  Efficiency  is  the  ratio  of  the  speedup 
value  to  the  number  of  processors  used.  The  HEFT 
and  DLS  algorithms  have  greater  efficiency  than  the 
other  algorithms;  when  the  number  of  processors  is 
increased  beyond  eight,  the  HEFT  algorithm  outper¬ 
forms  the  DLS  algorithm  in  terms  of  efficiency.  For  all 
scheduling  algorithms,  increasing  the  number  of  pro¬ 
cessors  reduces  efficiency  because  the  matrix  size  (thus 
the  number  of  nodes)  is  fixed.  Additionally,  an  increase 
in  the  number  of  processors  causes  fewer  nodes  than 
the  number  of  processors  at  some  levels  of  the  graph, 
which  will  decrease  the  utilization  of  the  processors. 
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Figure  8  gives  the  average  running  time  of  each  al¬ 
gorithm  for  the  Gaussian  elimination  graph  for  a  var¬ 
ious  number  of  processors.  Although  the  DLS  algo¬ 
rithm  gives  as  good  performance  as  the  HEFT  algo¬ 
rithm  for  Gaussian  elimination  graphs,  it  is  the  slow¬ 
est  algorithm.  (For  Gaussian  elimination  task  graphs 
when  matrix  size  is  50  with  16  processors,  the  DLS  al¬ 
gorithm  takes  16.2  times  longer  time  than  the  HEFT 
algorithm  to  schedule  tasks).  The  running  time  of  the 
HEFT  algorithm  is  comparable  to  the  MH  algorithm, 
but  it  is  greater  than  the  LMT  algorithm.  The  cost 
ranking  of  the  algorithms  (starting  from  the  lowest) 
for  the  Gaussian  elimination  graphs  is  {LMT, (HEFT, 
MH),CPOP,DLS}.  Although  the  LMT  Algorithm  is  a 
higher-complexity  algorithm  than  the  others  except  for 
the  DLS  algorithm  (see  Figure  4),  it  gives  the  best  run¬ 
ning  time  for  Gaussian  elimination  graphs  due  to  the 
fact  that  half  of  the  levels  of  a  Gaussian  elimination 
graph  has  a  single  node,  which  decreases  the  cost  of 
the  LMT  algorithm. 


(a) 


(b) 


Figure  8.  Average  Running  Time  for  Each 
Algorithm  to  Schedule  Gaussian  Elimination 
Graphs 


Figure  7.  (a)  Average  SLRs  of  Scheduling  Al¬ 
gorithms  at  Various  Graph  Sizes  for  Gaussian 
Elimination  Graph  (b)  The  Efficiency  Compar¬ 
ison  of  the  Algorithms 


Fast  Fourier  Transformation  The  recursive,  one¬ 
dimensional  FFT  Algorithm  [15,  16]  and  its  task  graph 
is  given  in  Figure  9.  A  is  an  array  of  size  n,  which  holds 
the  coefficients  of  the  polynomial;  array  Y  is  the  output 
of  the  algorithm.  The  algorithm  consists  of  two  parts: 
recursive  calls  (lines  3-4)  and  the  butterfly  operation 
(lines  6-7).  The  task  graph  in  Figure  9(b)  can  be  di¬ 
vided  into  two  parts:  the  nodes  above  the  dashed  line 
are  the  recursive  call  nodes  (RCNs),  and  the  ones  be¬ 
low  the  line  are  butterfly  operation  nodes  (BONs).  For 
an  input  vector  of  size  n,  there  are  2  xn^l  RCNs  and 


11 


n  X  log2  n  BONs.  (We  assume  that  n  =  2^  for  some 
integer  m).  Each  path  from  the  start  node  to  any  of 
the  exit  nodes  in  an  FFT  task  graph  is  a  critical  path 
because  the  computation  cost  of  nodes  in  any  level  are 
equal,  and  the  communication  costs  of  edges  between 
two  consecutive  levels  are  equal. 


.)  b) 


Figure  9.  (a)  FFT  Algorithm  (b)  The  Generated 
DAG  of  FFT  with  Four  Points. 

Figure  10(a)  shows  the  average  SLR  values  of 
scheduling  algorithms  for  FFT  graphs  for  various  num¬ 
bers  of  input  points  when  the  number  of  processors 
is  equal  to  six.  The  HEFT  algorithm  outperforms 
the  other  algorithms.  The  performance  ranking  of 
the  algorithms  is  {HEFT,  DLS,  CPOP,  MH,  LMT}. 
Figure  10(b)  shows  the  efficiency  values  obtained  for 
each  of  the  algorithms  when  there  are  64  data  points. 
The  HEFT  and  DLS  algorithms  result  in  good  sched¬ 
ules  for  all  cases.  The  running  time  comparisons 
of  the  algorithms,  with  respect  to  number  of  input 
points  and  with  respect  to  number  of  processors,  were 
shown  in  Figure  11.  For  FFT  graphs,  the  DLS  and 
LMT  Algorithms  are  high-cost  scheduling  algorithms, 
whereas  the  HEFT,  CPOP,  and  MH  Algorithms  are 
low-complexity  scheduling  algorithms.  Based  on  these 
results  it  can  be  concluded  that  the  HEFT  algorithm 
is  the  best  algorithm  in  terms  of  performance  and  cost. 

6.  Conclusion  and  Future  Work 

In  this  paper  we  have  proposed  two  task  schedul¬ 
ing  algorithms  (the  HEFT  Algorithm  and  the  CPOP 
Algorithm)  for  heterogeneous  processors.  For  the  gen¬ 
erated  random  task  graphs  and  the  task  graphs  of 
for  selected  applications,  the  HEFT  algorithm  outper¬ 
forms  the  other  algorithms  in  all  performance  metrics, 
i.e.,  average  schedule  length  ratio  (SLR),  speedup,  and 
time-complexity.  Similarly,  the  CPOP  Algorithm  is 
better  than,  or  at  least  comparable  to,  the  existing  al¬ 
gorithms.  Both  algorithms  perform  more  stably  than 


Input  Points 


(a) 


Number  of  Processors 

(b) 

Figure  10.  (a)  Average  SLRs  of  Scheduling 
Algorithms  at  Various  Graph  Sizes  for  FFT 
Graph  (b)  Efficiency  Comparison  of  the  Al¬ 
gorithms 
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Running  Time  (in  sec.)  Running  Time  (in  sec,) 


(a) 


Number  of  Processors 


(b) 


Figure  11.  Running  Times  of  the  Scheduling 
Algorithms  for  FFT  Graphs 


the  others  in  terms  of  scheduling  quality  and  running 
time. 

We  are  extending  our  algorithms  to  improve  their 
performance  at  specific  CCR  ranges  and  computing 
power  weight  ranges.  Additionally,  we  plan  to  add  low- 
cost,  local-search  techniques  to  improve  the  scheduling 
quality  of  our  algorithms.  Future  work  will  include 
different  techniques  for  ordering  the  ready  tasks  and 
extensions  for  the  processor  selection  phase. 
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Abstract 

Heterogeneous  computing  (HC)  environments  are 
well  suited  to  meet  the  computational  demands  of  large, 
diverse  groups  of  tasks  (i.e.,  a  meta-task).  The  prob¬ 
lem  of  mapping  (defined  as  matching  and  scheduling) 
these  tasks  onto  the  machines  of  an  HC  environment 
has  been  shown,  in  general,  to  be  NP-complete,  requir¬ 
ing  the  development  of  heuristic  techniques.  Selecting 
the  best  heuristic  to  use  in  a  given  environment,  how¬ 
ever,  remains  a  difficult  problem,  because  comparisons 
are  often  clouded  by  different  underlying  assumptions 
in  the  original  studies  of  each  heuristic.  Therefore,  a 
collection  of  eleven  heuristics  from  the  literature  has 
been  selected,  implemented,  and  analyzed  under  one  set 
of  common  assumptions.  The  eleven  heuristics  exam¬ 
ined  are  Opportunistic  Load  Balancing,  User-Directed 
Assignment,  Fast  Greedy,  Min-min,  Max-min,  Greedy, 
Genetic  Algorithm,  Simulated  Annealing,  Genetic  Sim¬ 
ulated  Annealing,  Tabu,  and  A*.  This  study  provides 
one  even  basis  for  comparison  and  insights  into  circum¬ 
stances  where  one  technique  will  outperform  another. 
The  evaluation  procedure  is  specified,  the  heuristics  are 
defined,  and  then  selected  results  are  compared. 


This  research  was  supported  in  part  by  the  DARPA/ITO  Quo¬ 
rum  Program  under  NPS  subcontract  numbers  N62271-98-M- 
0217  and  N62271-98-M-0448.  Some  of  the  equipment  used  was 
donated  by  Intel. 


1.  Introduction 

Mixed-machine  heterogeneous  computing  (HC)  en¬ 
vironments  utilize  a  distributed  suite  of  different  high- 
performance  machines,  interconnected  with  high-speed 
links  to  execute  different  computationally  intensive 
applications  that  have  diverse  computational  require¬ 
ments  [10,  18,  24].  The  general  problem  of  mapping 
(i.e.,  matching  and  scheduling)  tasks  to  machines  has 
been  shown  to  be  NP-complete  [8,  15].  Heuristics  de¬ 
veloped  to  perform  this  mapping  function  are  often 
difficult  to  compare  because  of  different  underlying  as¬ 
sumptions  in  the  original  studies  of  each  heuristic  [3]. 
Therefore,  a  collection  of  eleven  heuristics  from  the  lit¬ 
erature  has  been  selected,  implemented,  and  compared 
by  simulation  studies  under  one  set  of  common  assump¬ 
tions. 

To  facilitate  these  comparisons,  certain  simplifying 
assumptions  were  made.  Let  a  meta-task  be  defined 
as  a  collection  of  independent  tasks  with  no  data  de¬ 
pendencies  (a  given  task,  however,  may  have  subtasks 
and  dependencies  among  the  subtasks).  For  this  case 
study,  it  is  assumed  that  static  (i.e.,  off-line  or  predic¬ 
tive)  mapping  of  meta-tasks  is  being  performed.  (In 
some  systems,  all  tasks  and  subtasks  in  a  meta-task, 
as  defined  above,  are  referred  to  as  just  tasks.) 

It  is  also  assumed  that  each  machine  executes  a  sin¬ 
gle  task  at  a  time,  in  the  order  in  which  the  tasks  ar- 
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rived.  Because  there  are  no  dependencies  among  the 
tasks,  scheduling  is  simplified,  and  thus  the  resulting 
solutions  of  the  mapping  heuristics  focus  more  on  find¬ 
ing  an  efficient  matching  of  tasks  to  machines.  It  is 
also  assumed  that  the  size  of  the  meta-task  (number 
of  tasks  to  execute) ,  i,  and  the  number  of  machines  in 
the  HC  environment,  m,  are  static  and  known  a  priori. 

Section  2  defines  the  computational  environment  pa¬ 
rameters  that  were  varied  in  the  simulations.  Descrip¬ 
tions  of  the  eleven  mapping  heuristics  are  found  in  Sec¬ 
tion  3.  Section  4  examines  selected  results  from  the 
simulation  study.  A  list  of  implementation  parameters 
and  procedures  that  could  be  varied  for  each  heuristic 
is  presented  in  Section  5. 

This  research  was  supported  in  part  by  the 
DARPA/ITO  Quorum  Program  project  called  MSHN 
(Management  System  for  Heterogeneous  Networks) 
[13].  MSHN  is  a  collaborative  research  effort  among  the 
Naval  Postgraduate  School,  NOEMIX,  Purdue  Univer¬ 
sity,  and  the  University  of  Southern  California.  The 
technical  objective  of  the  MSHN  project  is  to  design, 
prototype,  and  refine  a  distributed  resource  manage¬ 
ment  system  that  leverages  the  heterogeneity  of  re¬ 
sources  and  tasks  to  deliver  requested  qualities  of  ser¬ 
vice.  The  heuristics  developed  in  this  paper  or  their 
derivatives  may  be  included  in  the  Scheduling  Advisor 
component  of  the  MSHN  prototype. 

2.  Simulation  Model 

The  eleven  static  mapping  heuristics  were  evaluated 
using  simulated  execution  times  for  an  HC  environ¬ 
ment.  Because  these  are  static  heuristics,  it  is  assumed 
that  an  accurate  estimate  of  the  expected  execution 
time  for  each  task  on  each  machine  is  known  prior  to  ex¬ 
ecution  and  contained  within  an  BTC  (expected  time 
to  compute)  matrix.  One  row  of  the  ETC  matrix  con¬ 
tains  the  estimated  execution  times  for  a  given  task 
on  each  machine.  Similarly,  one  column  of  the  ETC 
matrix  consists  of  the  estimated  execution  times  of  a 
given  machine  for  each  task  in  the  meta-task.  Thus, 
ETC(iJ)  is  the  estimated  execution  time  for  task  i  on 
machine  j.  (These  times  are  assumed  to  include  the 
time  to  move  the  executables  and  data  associated  with 
each  task  to  the  particular  machine  when  necessary.) 
The  assumption  that  these  estimated  expected  execu¬ 
tion  times  are  known  is  commonly  made  when  studying 
mapping  heuristics  for  HC  systems  (e.g.,  [11,  16,  25]). 
(Approaches  for  doing  this  estimation  based  on  task 
profiling  and  analytical  benchmarking  are  discussed  in 
[14,24].) 

For  the  simulation  studies,  characteristics  of  the 
ETC  matrices  were  varied  in  an  attempt  to  represent 


a  variety  of  possible  HC  environments.  The  ETC  ma¬ 
trices  used  were  generated  using  the  following  method. 
Initially,  a  f  x  1  baseline  column  vector,  B_,  of  floating 
point  values  is  created.  Let  ^  be  the  upper-bound  of 
the  range  of  possible  values  within  the  baseline  vector. 
The  baseline  column  vector  is  generated  by  repeatedly 
selecting  a  uniform  random  number,  G  [1,  06) »  and 
letting  B[i)  =  xl  for  0  <  i  <  t.  Next,  the  rows  of  the 
ETC  matrix  are  constructed.  Each  element  ETC{i^j) 
in  row  i  of  the  ETC  matrix  is  created  by  taking  the 
baseline  value,  B{i),  and  multiplying  it  by  a  uniform 
random  number,  ^  which  has  an  upper-bound  of 
0^.  This  new  random  number,  xp^  G  [1,  0r))  is  called 
a  row  multiplier.  One  row  requires  m  different  row 
multipliers,  0  <  j  <  m.  Each  row  i  of  the  ETC  ma¬ 
trix  can  be  then  described  as  ETC{iij)  —  B{i)  x  xp^ , 
for  0  <  j  <  m.  (The  baseline  column  itself  does  not 
appear  in  the  final  ETC  matrix.)  This  process  is  re¬ 
peated  for  each  row  until  the  m  x  i  ETC  matrix  is 
full.  Therefore,  any  given  value  in  the  ETC  matrix  is 
within  the  range  [1,  06  x  0r). 

To  evaluate  the  heuristics  for  different  mapping  sce¬ 
narios,  the  characteristics  of  the  ETC  matrix  were  var¬ 
ied  based  on  several  different  methods  from  [2].  The 
amount  of  variance  among  the  execution  times  of  tasks 
in  the  meta-task  for  a  given  machine  is  defined  as  task 
heterogeneity.  Task  heterogeneity  was  varied  by  chang- 
ing  the  upper-bound  of  the  random  numbers  within  the 
baseline  column  vector.  High  task  heterogeneity  was 
represented  by  0^  =  3000  and  low  task  heterogeneity 
used  06  =  100.  Machine  heterogeneity  represents  the 
variation  that  is  possible  among  the  execution  times  for 
a  given  task  across  all  the  machines.  Machine  hetero¬ 
geneity  was  varied  by  changing  the  upper-bound  of  the 
random  numbers  used  to  multiply  the  baseline  values. 
High  machine  heterogeneity  values  were  generated  us¬ 
ing  0r  =  1000  ,  while  low  machine  heterogeneity  values 
used  0r  =  10.  These  heterogeneous  ranges  are  based 
on  one  type  of  expected  environment  for  MSHN. 

To  further  vary  the  ETC  matrix  in  an  attempt  to 
capture  more  aspects  of  realistic  mapping  situations, 
different  jETC  matrix  consistences  were  used.  An  ETC 
matrix  is  said  to  be  consistent  if  whenever  a  machine 
j  executes  any  task  i  faster  than  machine  /?,  then  ma¬ 
chine  j  executes  all  tasks  faster  than  machine  k  [2]. 
Consistent  matrices  were  generated  by  sorting  each 
row  of  the  ETC  matrix  independently.  In  contrast, 
inconsistent  matrices  characterize  the  situation  where 
machine  j  is  faster  than  machine  k  for  some  tasks,  and 
slower  for  others.  These  matrices  are  left  in  the  un¬ 
ordered,  random  state  in  which  they  were  generated.  In 
between  these  two  extremes,  semi-consistent  matrices 
represent  a  partial  ordering  among  the  machine/task 
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execution  times.  For  the  semi-consistent  matrices  used 
here,  the  row  elements  in  column  positions  {0, 2,4,,. .} 
of  row  i  are  extracted,  sorted,  and  replaced  in  order, 
while  the  row  elements  in  column  positions  {1,3,5,...} 
remain  unordered.  (That  is,  the  even  columns  are  con¬ 
sistent  and  the  odd  columns  are  inconsistent.) 

Sample  ETC  matrices  for  the  four  inconsistent  het¬ 
erogeneous  permutations  of  the  characteristics  listed 
above  are  shown  in  Tables  1  through  4.  (Other  proba¬ 
bility  distributions  for  ETC  values,  including  an  expo¬ 
nential  distribution  and  a  truncated  Gaussian  [1]  dis¬ 
tribution,  were  also  investigated,  but  not  included  in 
the  results  discussed  here.)  All  results  in  this  study 
used  ETC  matrices  that  were  of  size  t  =  512  tasks  by 
m  =  16  machines.  While  it  was  necessary  to  select 
some  specific  parameter  values  to  allow  implementa¬ 
tion  of  a  simulation,  the  characteristics  and  techniques 
presented  here  are  completely  general.  Therefore,  if 
these  parameter  values  do  not  apply  to  a  specific  sit¬ 
uation  of  interest,  researchers  may  substitute  in  their 
own  ranges,  distributions,  matrix  sizes,  etc.,  and  the 
evaluation  software  of  this  study  will  still  apply. 

3.  Heuristic  Descriptions 

The  definitions  of  the  eleven  static  meta-task  map¬ 
ping  heuristics  are  provided  below.  First,  some  pre¬ 
liminary  terms  must  be  defined.  Machine  availability 
time,  mat{j) ,  is  the  earliest  time  a  machine  j  can  com¬ 
plete  the  execution  of  all  the  tasks  that  have  previously 
been  assigned  to  it.  Completion  time.  d(2,y),  is  the 
machine  availability  time  plus  the  execution  time  of 
task  i  on  machine  j,  i.e.,  ct{i,  j)  =  mat{j)  -h  ETC{i,  j). 
The  performance  criterion  used  to  compare  the  results 
of  the  heuristics  is  the  maximum  value  of  c<(z,y),  for 
0  <  i  t  and  0  <  j  <  m,  for  each  mapping,  also  known 
as  the  makespan  [19].  Each  heuristic  is  attempting  to 
minimize  the  makespan  (i.e.,  finish  execution  of  the 
meta-task  as  soon  as  possible) . 

The  descriptions  below  implicitly  assume  that  the 
machine  availability  times  are  updated  after  each  task 
is  mapped.  For  cases  when  tasks  can  be  considered  in 
an  arbitrary  order,  the  order  in  which  the  tasks  ap¬ 
peared  in  the  ETC  matrix  was  used.  Some  of  the 
heuristics  listed  below  had  to  be  modified  from  their 
original  implementation  to  better  handle  the  scenarios 
under  consideration. 

For  many  of  the  heuristics,  there  are  control  param¬ 
eter  values  and/or  control  function  specifications  that 
can  be  selected  for  a  given  implementation.  For  the 
studies  here,  such  values  and  specifications  were  se¬ 
lected  based  on  experimentation  and/or  information 
in  the  literature.  These  parameters  and  functions  are 


mentioned  in  Section  5. 

OLB:  Opportunistic  Load  Balancing  (OLB)  as¬ 
signs  each  task,  in  arbitrary  order,  to  the  next  available 
machine,  regardless  of  the  task’s  expected  execution 
time  on  that  machine  [1,  9,  10]. 

UDA:  In  contrast  to  OLB,  User-Directed 
Assignment  (UDA)  assigns  each  task,  in  arbi¬ 
trary  order,  to  the  machine  with  the  best  expected 
execution  time  for  that  task,  regardless  of  that  ma¬ 
chine’s  availability  [1].  UDA  is  sometimes  referred  to 
as  Limited  Best  Assignment  (LB A),  as  in  [1,  9]. 

Fast  Greedy:  Fast  Greedy  assigns  each  task,  in 
arbitrary  order,  to  the  machine  with  the  minimum 
completion  time  for  that  task  [1], 

Min-min:  The  Min-min  heuristic  begins  with  the 
set  U_  of  all  unmapped  tasks.  Then,  the  set  of 
minimum  completion  times,  M_  =  {m,*  :  m,-  = 

mino<j<m(c^(^i)))  for  each  i  E  U},  is  found.  Next, 
the  task  with  the  overall  minimum  completion  time 
from  M  is  selected  and  assigned  to  the  corresponding 
machine  (hence  the  name  Min-min).  Lastly,  the  newly 
mapped  task  is  removed  from  U,  and  the  process  re¬ 
peats  until  all  tasks  are  mapped  (i.e.,  {7  =  0)  [1,  9,  15]. 

Intuitively,  Min-min  attempts  to  map  as  many  tasks 
as  possible  to  their  first  choice  of  machine  (on  the  basis 
of  completion  time),  under  the  assumption  that  this 
will  result  in  a  shorter  makespan.  Because  tasks  with 
shorter  execution  times  are  being  mapped  first,  it  was 
expected  that  the  percentage  of  tasks  that  receive  their 
first  choice  of  machine  would  generally  be  higher  for 
Min-min  than  for  Max-min  (defined  next),  and  this 
was  verified  by  data  recorded  during  the  simulations. 

Max-min:  The  Max-min  heuristic  is  very  simi¬ 
lar  to  Min-min.  The  Max-min  heuristic  also  begins 
with  the  set  U_  of  all  unmapped  tasks.  Then,  the 
set  of  minimum  completion  times,  M  =  (m*  :  rrii  = 
mino<j<m(ct(z,  j)),  for  each  z  G  U},  is  found.  Next, 
the  task  with  the  overall  maximum  completion  time 
from  M  is  selected  and  assigned  to  the  corresponding 
machine  (hence  the  name  Max-min).  Lastly,  the  newly 
mapped  task  is  removed  from  {7,  and  the  process  re¬ 
peats  until  all  tasks  are  mapped  (i.e.,  {7  =  0)  [1,  9,  15]. 

The  motivation  behind  Max-min  is  to  attempt 
to  minimize  the  penalties  incurred  by  delaying  the 
scheduling  of  long-running  tasks.  Assume  that  the 
meta-task  being  mapped  has  several  tasks  with  short 
execution  times,  and  a  small  quantity  of  tasks  with 
very  long  execution  times.  Mapping  the  tasks  with  the 
longer  execution  times  to  their  best  machines  first  al¬ 
lows  these  tasks  to  be  executed  concurrently  with  the 
remaining  tasks  (with  shorter  execution  times).  This 
concurrent  execution  of  long  and  short  tasks  can  be 
more  beneficial  than  a  Min-min  mapping  where  all  of 
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the  shorter  tasks  would  execute  first,  and  then  a  few 
longer  running  tasks  execute  while  several  machines 
sit  idle.  The  assumption  here  is  that  with  Max-min 
the  tasks  with  shorter  execution  times  can  be  mixed 
with  longer  tasks  and  evenly  distributed  among  the 
machines,  resulting  in  better  machine  utilization  and  a 
better  meta-task  makespan. 

Greedy:  The  Greedy  heuristic  is  literally  a  com¬ 
bination  of  the  Min-min  and  Max-min  heuristics.  The 
Greedy  heuristic  performs  both  of  the  Min-min  and 
Max-min  heuristics,  and  uses  the  better  solution  [1,  9]. 

G A:  Genetic  Algorithms  (GAs)  are  a  popular  tech¬ 
nique  used  for  searching  large  solution  spaces  (e.g., 
[25,  27]).  The  version  of  the  heuristic  used  for  this 
study  was  adapted  from  [27]  for  this  particular  solution 
space.  Figure  1  shows  the  steps  in  a  general  Genetic 
Algorithm. 

The  Genetic  Algorithm  implemented  here  operates 
on  a  population  of  200  chromosomes  (possible  map¬ 
pings)  for  a  given  meta-task.  Each  chromosome  is  a 
t  X  I  vector,  where  position  i  {0  <  i  <  t)  represents 
task  z,  and  the  entry  in  position  i  is  the  machine  to 
which  the  task  has  been  mapped.  The  initial  popula¬ 
tion  is  generated  using  two  methods:  (a)  200  randomly 
generated  chromosomes  from  a  uniform  distribution,  or 
(b)  one  chromosome  that  is  the  Min-min  solution  and 
199  random  solutions  (mappings).  The  latter  method 
is  called  seeding  the  population  with  a  Min-min  chro¬ 
mosome.  The  GA  actually  executes  eight  times  (four 
times  with  initial  populations  from  each  method),  and 
the  best  of  the  eight  mappings  is  used  as  the  final  so¬ 
lution. 

After  the  generation  of  the  initial  population,  all  of 
the  chromosomes  in  the  population  are  evaluated  (i.e., 
ranked)  based  on  their  fitness  value  (i.e.,  makespan), 
with  a  smaller  fitness  value  being  a  better  mapping. 
Then,  the  main  loop  in  Figure  1  is  entered  and  a  rank- 
bcLsed  roulette  wheel  scheme  [26]  is  used  for  selection. 
This  scheme  probabilistically  generates  new  popula¬ 
tions,  where  better  mappings  have  a  higher  probability 
of  surviving  to  the  next  generation.  Elitism,  the  prop¬ 
erty  of  guaranteeing  the  best  solution  remains  in  the 
population  [20],  was  also  implemented. 

Next,  the  crossover  operation  selects  a  pair  of  chro¬ 
mosomes  and  chooses  a  random  point  in  the  first  chro¬ 
mosome.  For  the  sections  of  both  chromosomes  from 
that  point  to  the  end  of  each  chromosome,  crossover  ex¬ 
changes  machine  assignments  between  corresponding 
tasks.  Every  chromosome  is  considered  for  crossover 
with  a  probability  of  60%. 

After  crossover,  the  mutation  operation  is  per¬ 
formed.  Mutation  randomly  selects  a  task  within  the 
chromosome,  and  randomly  reassigns  it  to  a  new  ma¬ 


chine.  Both  of  these  random  operations  select  values 
from  a  uniform  distribution.  Every  chromosome  is  con¬ 
sidered  for  mutation  with  a  probability  of  40%. 

Finally,  the  chromosomes  from  this  modified  popu¬ 
lation  are  evaluated  again.  This  completes  one  itera¬ 
tion  of  the  GA.  The  GA  stops  when  any  one  of  three 
conditions  are  met:  (a)  1000  total  iterations,  (b)  no 
change  in  the  elite  chromosome  for  150  iterations,  or 
(c)  all  chromosomes  converge.  If  no  stopping  criteria  is 
met,  the  loop  repeats,  beginning  with  the  selection  of 
a  new  population.  The  stopping  criteria  that  usually 
occurred  in  testing  was  no  change  in  the  elite  chromo¬ 
some  in  150  iterations. 

S A:  Simulated  Annealing  (SA)  is  an  iterative  tech¬ 
nique  that  considers  only  one  possible  solution  (map¬ 
ping)  for  each  meta-task  at  a  time.  This  solution  uses 
the  same  representation  for  a  solution  as  the  chromo¬ 
some  for  the  GA. 

SA  uses  a  procedure  that  probabilistically  allows 
poorer  solutions  to  be  accepted  to  attempt  to  obtain 
a  better  search  of  the  solution  space  (e.g.,  [6,  17,  21]). 
This  probability  is  based  on  a  system  temperature  that 
decreases  for  each  iteration.  As  the  system  tempera¬ 
ture  “cools,”  it  is  more  difficult  for  currently  poorer 
solutions  to  be  accepted.  The  initial  system  tempera¬ 
ture  is  the  makespan  of  the  initial  mapping. 

The  specific  SA  procedure  implemented  here  is  as 
follows.  The  initial  mapping  is  generated  from  a  uni¬ 
form  random  distribution.  The  mapping  is  mutated  in 
the  same  manner  as  the  GA,  and  the  new  makespan 
is  evaluated.  The  decision  algorithm  for  accepting  or 
rejecting  the  new  mapping  is  based  on  [6].  If  the  new 
makespan  is  better,  the  new  mapping  replaces  the  old 
one.  If  the  new  makespan  is  worse  (larger),  a  uniform 
random  number  £  G  [0,  1)  is  selected.  Then,  2:  is  com¬ 
pared  with  y,  where 


^  /old  makespan-new  makespan^  '  ^  ^ 

_j_  g  V  temperature  J 

If  z  >  y  the  new  (poorer)  mapping  is  accepted,  other¬ 
wise  it  is  rejected,  and  the  old  mapping  is  kept. 

Notice  that  for  solutions  with  similar  makespans  (or 
if  the  system  temperature  is  very  large),  1/  ^  0.5,  and 
poorer  solutions  are  more  easily  accepted.  In  contrast, 
for  solutions  with  very  different  makespans  (or  if  the 
system  temperature  is  very  small),  t/  -4  1,  and  poorer 
solutions  will  usually  be  rejected. 

After  each  mutation,  the  system  temperature  is  de¬ 
creased  by  10%.  This  defines  one  iteration  of  SA.  The 
heuristic  stops  when  there  is  no  change  in  the  makespan 
for  150  iterations  or  the  system  temperature  reaches 
zero.  Most  tests  ended  with  no  change  in  the  makespan 
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for  150  iterations. 

GSA:  The  Genetic  Simulated  Annealing  (GSA) 
heuristic  is  a  combination  of  the  G  A  and  S  A  techniques 
[4,  23].  In  general,  GSA  follows  procedures  similar  to 
the  GA  outlined  above.  GSA  operates  on  a  popula¬ 
tion  of  200  chromosomes,  uses  a  Min-min  seed  in  four 
out  of  eight  initial  populations,  and  performs  similar 
mutation  and  crossover  operations.  However,  for  the 
selection  process,  GSA  uses  the  SA  cooling  schedule 
and  system  temperature,  and  a  simplified  SA  decision 
process  for  accepting  or  rejecting  a  new  chromosomes. 
GSA  also  used  elitism  to  guarantee  that  the  best  solu¬ 
tion  always  remained  in  the  population. 

The  initial  system  temperature  for  the  GSA  selec¬ 
tion  process  was  set  to  the  average  makespan  of  the 
initial  population,  and  decreased  10%  for  each  itera¬ 
tion.  When  a  new  (post-mutation,  post-crossover,  or 
both)  chromosome  is  compared  with  the  corresponding 
original  chromosome,  if  the  new  makespan  is  less  than 
the  old  makespan  plus  the  system  temperature,  then 
the  new  chromosome  is  accepted.  That  is,  if 

new  makespan  <  (old  makespan  -{-  temperature)  (2) 

is  true,  the  new  chromosome  becomes  part  of  the  pop¬ 
ulation.  Otherwise,  the  original  chromosome  survives 
to  the  next  iteration.  Therefore,  as  the  system  tem¬ 
perature  decreases,  it  is  again  more  difficult  for  poorer 
solutions  (i.e.,  longer  makespans)  to  be  accepted.  The 
two  stopping  criteria  used  were  either  (a)  no  change  in 
the  elite  chromosome  in  150  iterations  or  (b)  1000  total 
iterations.  Again,  the  most  common  stopping  criteria 
was  no  change  in  the  elite  chromosome  in  150  itera¬ 
tions. 

Tabu:  Tabu  search  is  a  solution  space  search  that 
keeps  track  of  the  regions  of  the  solution  space  which 
have  already  been  searched  so  as  not  to  repeat  a  search 
near  these  areas  [7,  12].  A  solution  (mapping)  uses 
the  same  representation  as  a  chromosome  in  the  GA 
approach. 

The  implementation  of  Tabu  search  used  here  be¬ 
gins  with  a  random  mapping,  generated  from  a  uni¬ 
form  distribution.  Starting  with  the  first  task  in  the 
mapping,  task  i  =  0,  each  possible  pair  of  tasks  is 
formed,  (f,  j)  for  0  <  i  f  -  1  and  i  <  j  <  i.  As 
each  pair  of  tasks  is  formed,  they  exchange  machine 
assignments.  This  constitutes  a  short  hop.  The  in¬ 
tuitive  purpose  of  a  short  hop  is  to  find  the  nearest 
local  minimum  solution  within  the  solution  space.  Af¬ 
ter  each  exchange,  the  new  makespan  is  evaluated.  If 
the  new  makespan  is  an  improvement,  the  new  map¬ 
ping  is  accepted  (a  successful  short  hop),  and  the  pair 
generation- and- exchange  sequence  starts  over  from  the 
beginning  (i  =  0)  of  the  new  mapping.  Otherwise,  the 


pair  generation- and-exchange  sequence  continues  from 
its  previous  state,  (z,  j).  New  short  hops  are  generated 
until  1200  successful  short  hops  have  been  made  or  all 
combinations  of  task  pairs  have  been  exhausted  with 
no  further  improvement. 

At  this  point,  the  final  mapping  from  the  local  so¬ 
lution  space  search  is  added  to  the  tabu  list.  The  tabu 
list  is  a  method  of  keeping  track  of  the  regions  of  the 
solution  space  that  have  already  been  searched.  Next, 
a  new  random  mapping  is  generated,  and  it  must  differ 
from  each  mapping  in  the  tabu  list  by  at  least  half  of 
the  machine  assignments  (a  successful  long  hop).  The 
intuitive  purpose  of  a  long  hop  is  to  move  to  a  new 
region  of  the  solution  space  that  has  not  already  been 
searched.  The  final  stopping  criterion  for  the  heuristic 
is  a  total  of  1200  successful  hops  (short  and  long  com¬ 
bined).  Then,  the  best  mapping  from  the  tabu  list  is 
the  final  answer. 

A*:  The  final  heuristic  in  the  comparison  study  is 
known  as  the  A^  heuristic.  A*  has  been  applied  to 
many  other  task  allocation  problems  (e.g.,  [5,  16,  21, 
22]).  The  technique  used  here  is  similar  to  [5]. 

A*  is  a  tree  search  beginning  at  a  root  node  that  is 
usually  a  null  solution.  As  the  tree  grows,  intermediate 
nodes  represent  partial  solutions  (a  subset  of  tasks  are 
assigned  to  machines),  and  leaf  nodes  represent  final 
solutions  (all  tasks  are  assigned  to  machines).  The  par¬ 
tial  solution  of  a  child  node  has  one  more  task  mapped 
than  the  parent  node.  Call  this  additional  task  a.  Each 
parent  node  generates  m  children,  one  for  each  possi¬ 
ble  mapping  of  a.  After  a  parent  node  has  done  this, 
the  parent  node  is  removed  and  replaced  in  the  tree  by 
the  m  children.  Based  on  experimentation  and  a  desire 
to  keep  execution  time  of  the  heuristic  tractable,  the 
maximum  number  of  nodes  in  the  tree  at  any  one  time 
is  limited  in  this  study  to  Umax  =  1024. 

Each  node,  n,  has  a  cost  function,  /(n),  associated 
with  it.  The  cost  function  is  an  estimated  lower-bound 
on  the  makespan  of  the  best  solution  that  includes  the 
partial  solution  represented  by  node  n.  Let  g{n)  repre¬ 
sent  the  makespan  of  the  task/machine  assignments  in 
the  partial  solution  of  node  n,  i.e.,  g[n)  is  the  maximum 
of  the  machine  availability  times  {mat[j))  based  on  the 
set  of  tasks  that  have  been  mapped  to  machines  in  node 
n’s  partial  solution.  Let  h[n)  be  a  lower-bound  esti¬ 
mate  on  the  difference  between  the  makespan  of  node 
n’s  partial  solution  and  the  makespan  for  the  best  com¬ 
plete  solution  that  includes  node  n^s  partial  solution. 
Then,  the  cost  function  for  node  n  is  computed  as 

f{n)  =  g{n) +h{n).  (3) 

Therefore,  f{n)  represents  the  makespan  of  the  partial 
solution  of  node  n  plus  a  lower-bound  estimate  of  the 


19 


time  to  execute  the  rest  of  the  (unmapped)  tasks  in  the 
meta-task. 

The  function  h{n)  is  defined  in  terms  of  two  func¬ 
tions,  hi{n)  and  h2{n),  which  are  two  different  ap¬ 
proaches  to  deriving  a  lower-bound  estimate.  Recall 
that  M  =  {mi  :  m,-  =  mino<j<m(ct(i,i)),  for  each  i  E 
U}.  For  node  n  let  mmct{n)  be  the  overall  maximum 
element  of  M  over  all  z  E  {/  ^i.e.,  “the  maximum  min¬ 
imum  completion  time”).  Intuitively,  mmct{n)  repre¬ 
sents  the  best  possible  meta-task  makespan  by  making 
the  typically  unrealistic  assumption  that  each  task  in  U 
can  be  assigned  to  the  machine  indicated  in  M  without 
conflict.  Thus,  based  on  [5],  hi{n)  is  defined  as 

/ii(n)  =  max(0,  {mmct{n)  -  g{n))).  (4) 

Next,  let  sdma{n)  be  the  sum  of  the  differences  be¬ 
tween  g{n)  and  each  machine  availability  time  over  all 
machines  after  executing  all  of  the  tasks  in  the  partial 
solution  represented  by  node  n: 

m  —  i 

sdma{n)  =  ^  (5^(^)  “  mat{j)).  (5) 

i=o 

Intuitively,  sdma(n)  represents  the  amount  of  machine 
availability  time  remaining  that  can  be  scheduled  with¬ 
out  increasing  the  final  makespan.  Let  smet[n)  be  de¬ 
fined  as  the  sum  of  the  minimum  expected  execution 
times  (i.e.,  ETC  values)  for  all  tasks  in  U : 

smet{n)  =y{  min  {ETC{iJ))  (6) 

0<j<m 

This  gives  an  estimate  of  the  amount  of  remaining  work 
to  do,  which  could  increase  the  final  makespan.  The 
function  /12  is  then  defined  as 

/i2(n)  =  max(0,  {smei{n)  -  sdma{n)) / m) ,  (7) 

where  {smet{n)  -  sdma{n))lm  represents  an  estimate 
of  the  minimum  increase  in  the  meta-task  makespan 
if  the  tasks  in  U  could  be  “ideally”  (but,  in  general, 
unrealistically)  distributed  among  the  machines.  Using 
these  definitions, 

h{n)  =  max(/ii(n),  h2{n)),  (8) 

representing  a  lower-bound  estimate  on  the  time  to  ex¬ 
ecute  the  tasks  in  U. 

Thus,  beginning  with  the  root,  the  node  with  the 
minimum /(n)  is  replaced  by  its  m  children,  until  rimax 
nodes  are  created.  From  that  point  on,  any  time  a  node 
is  added,  the  tree  is  pruned  by  deleting  the  node  with 
the  largest  /(n).  This  process  continues  until  a  leaf 
node  (representing  a  complete  mapping)  is  reached. 


Note  that  if  the  tree  is  not  pruned,  this  method  is 
equivalent  to  an  exhaustive  search. 

These  eleven  heuristics  were  all  implemented  under 
the  common  simulation  model  described  in  Section  2. 
The  results  from  experiments  using  these  implemen¬ 
tations  are  described  in  the  next  section.  Suggestions 
for  alternative  heuristic  implementations  are  given  in 
Section  5. 

4.  Experimental  Results 

An  interactive  software  application  has  been  devel¬ 
oped  that  allows  simulation,  testing,  and  demonstra¬ 
tion  of  the  heuristics  examined  in  Section  3  applied  to 
the  meta-tasks  defined  by  the  ETC  matrices  described 
in  Section  2.  The  software  allows  a  user  to  specify 
t  and  m,  to  select  which  ETC  matrices  to  use,  and 
to  choose  which  heuristics  to  execute.  It  then  gener¬ 
ates  the  specified  ETC  matrices,  executes  the  desired 
heuristics,  and  displays  the  results,  similar  to  Figures  2 
through  13.  The  results  discussed  in  this  section  were 
generated  using  portions  of  this  software. 

When  comparing  mapping  heuristics,  the  execu¬ 
tion  time  of  the  heuristics  themselves  is  an  impor¬ 
tant  consideration.  For  the  heuristics  listed,  the  ex¬ 
ecution  times  varied  greatly.  The  experimental  results 
discussed  below  were  obtained  on  a  Pentium  II  400 
MHz  processor  with  1GB  of  RAM.  Each  of  the  sim¬ 
pler  heuristics  (OLB,  UDA,  Fast  Greedy,  and  Greedy) 
executed  in  a  few  seconds  for  one  ETC  matrix  with 
t  =  512  and  m  =  16.  For  the  same  sized  ETC  ma¬ 
trix,  SA  and  Tabu,  both  of  which  manipulate  a  single 
solution  during  an  iteration,  averaged  less  than  30  sec¬ 
onds.  GA  and  GSA  required  approximately  60  seconds 
per  matrix  because  they  manipulate  entire  populations, 
and  A*  required  about  20  minutes  per  matrix. 

The  resulting  meta-task  execution  times 
(makespans)  from  the  simulations  for  every  case 
of  consistency,  task  heterogeneity,  and  machine  het¬ 
erogeneity  are  shown  in  Figures  2  through  13.  All 
experimental  results  represent  the  execution  time  of 
a  meta-task  (defined  by  a  particular  ETC  matrix) 
based  on  the  mapping  found  by  the  heuristic  specified, 
averaged  over  100  different  ETC  matrices  of  the  same 
type  (i.e.,  100  mappings).  For  each  heuristic,  the  range 
bars  show  the  minimum  and  maximum  meta-task 
execution  times  over  the  100  mappings  (100  ETC 
matrices)  used  to  compute  the  average  meta-task 
execution  time.  Tables  1  through  4  show  sample 
subsections  from  the  four  types  of  inconsistent  ETC 
matrices  considered.  Semi-consistent  and  consistent 
matrices  of  the  same  types  could  be  generated  from 
these  matrices  as  described  in  Section  2.  For  the 
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results  described  here,  however,  entirely  new  matrices 
were  generated  for  each  case. 

For  the  four  consistent  cases,  Figures  2  through  5, 
the  UDA  algorithm  had  the  worst  execution  times  by 
an  order  of  magnitude.  This  is  easy  to  explain.  For  the 
consistent  cases,  all  tasks  will  have  the  lowest  execu¬ 
tion  time  on  one  machine,  and  all  tasks  will  be  mapped 
to  this  particular  machine.  This  corresponds  to  results 
found  in  [1].  Because  of  this  poor  performance,  the 
UDA  results  were  not  included  in  Figures  2  through 
5.  OLB,  Max-min,  and  SA  had  the  next  poorest  re¬ 
sults.  GA  performed  the  best  for  the  consistent  cases. 
This  was  due  in  part  to  the  good  performance  of  the 
Min-min  heuristic.  The  best  GA  solution  always  came 
from  one  of  the  populations  that  had  been  seeded  with 
the  Min-min  solution.  As  is  apparent  in  the  figures, 
Min-min  performed  very  well  on  its  own,  giving  the 
second  best  results.  The  mutation,  crossover,  and  se¬ 
lection  operations  of  the  GA  were  always  able  to  im¬ 
prove  on  this  solution,  however.  GSA,  which  also  used 
a  Min-min  seed,  did  not  always  improve  upon  the  Min- 
min  solution.  Because  of  the  probabilistic  procedure 
used  during  selection,  GSA  would  sometimes  accept 
poorer  intermediate  solutions.  These  poorer  interme¬ 
diate  solutions  never  led  to  better  final  solutions,  thus 
GSA  gave  the  third  best  results.  The  performance  of 
A*  was  hindered  because  the  estimates  made  by  hi{n) 
and  h2{n)  are  not  as  accurate  for  consistent  cases  as 
they  are  for  inconsistent  and  semi-consistent  cases.  For 
consistent  cases,  hi{n)  underestimates  the  competition 
for  machines  and  h2{n)  underestimates  the  “workload” 
distributed  to  each  machine. 

These  results  suggest  that  if  the  best  overall  solu¬ 
tion  is  desired,  the  GA  should  be  employed.  However, 
the  improvement  of  the  GA  solution  over  the  Min-min 
solution  was  never  more  than  10%.  Therefore,  the  Min- 
min  hueristic  may  be  more  appropriate  in  certain  sit¬ 
uations,  given  the  difference  in  execution  times  of  the 
two  heuristics. 

For  the  four  inconsistent  test  cases  in  Figures  6 
through  9,  UDA  performs  much  better  and  the  perfor¬ 
mance  of  OLB  degrades.  Because  there  is  no  pattern 
to  the  consistency,  OLB  will  assign  more  tasks  to  poor 
or  even  worst-case  machines,  resulting  in  poorer  sched¬ 
ules.  In  contrast,  UDA  improves  because  the  “best” 
machines  are  distributed  across  the  set  of  machines, 
thus  task  assignments  will  be  more  evenly  distributed 
among  the  set  of  machines  avoiding  load  imbalance. 
Similarly,  Fast  Greedy  and  Min-min  performed  very 
well,  and  slightly  outperformed  UDA,  because  the  ma¬ 
chines  providing  the  best  task  completion  times  are 
more  evenly  distributed  among  the  set  of  machines. 
Min-min  was  also  better  than  Max-min  for  all  of  the 


inconsistent  cases.  The  advantages  Min-min  gains  by 
mapping  “best  case”  tasks  first  outweighs  the  savings 
in  penalties  Max-min  has  by  mapping  “worst  case” 
tasks  first. 

Tabu  gave  the  second  poorest  results  for  the  in¬ 
consistent  cases,  at  least  16%  poorer  than  the  other 
heuristics.  Inconsistent  matrices  generated  more  suc¬ 
cessful  short  hops  than  the  associated  consistent  matri¬ 
ces.  Therefore,  fewer  long  hops  were  generated  and  less 
of  the  solution  space  was  searched,  resulting  in  poorer 
solutions.  The  increased  number  of  successful  short 
hops  for  inconsistent  matrices  can  be  explained  as  fol¬ 
lows.  The  pairwise  comparison  procedure  used  by  the 
short  hop  procedure  will  assign  machines  with  better 
performance  first,  early  in  the  search  procedure.  For 
the  consistent  cases,  these  machines  will  always  be  from 
the  same  set  of  machines.  For  inconsistent  cases,  these 
machines  could  be  any  machine.  Thus,  for  consistent 
cases,  the  search  becomes  somewhat  ordered,  and  the 
successful  short  hops  get  exhausted  faster.  For  incon¬ 
sistent  cases,  the  lack  of  order  means  there  are  more 
successful  short  hops,  resulting  in  fewer  long  hops. 

GA  and  A*  had  the  best  average  makespans,  and 
were  usually  within  a  small  constant  factor  of  each 
other.  The  random  approach  employed  by  these  meth¬ 
ods  was  useful  and  helped  overcome  the  difficulty  of 
locating  good  mappings  within  inconsistent  matrices. 
GA  again  benefited  from  having  the  Min-min  ini¬ 
tial  mapping.  A*  did  well  because  if  the  tasks  get 
more  evenly  distributed  among  the  machines,  this  more 
closely  matches  the  lower-bound  estimates  of  hi  (n)  and 
h2(n). 

Finally,  consider  the  semi-consistent  cases  in  Figures 
10  through  13.  For  semi-consistent  cases  with  high  ma¬ 
chine  heterogeneity,  the  UDA  heuristic  again  gave  the 
worst  results.  Intuitively,  UDA  is  suffering  from  the 
same  problem  as  in  the  consistent  cases:  half  of  all 
tasks  are  getting  assigned  to  the  same  machine.  OLB 
does  poorly  for  high  machine  heterogeneity  cases  be¬ 
cause  worst  case  matchings  will  have  higher  execution 
times  for  high  machine  heterogeneity.  For  low  ma¬ 
chine  heterogeneity,  the  worst  case  matchings  have  a 
much  lower  penalty.  The  best  heuristics  for  the  semi- 
consistent  cases  were  Min-min  and  GA.  This  is  not  sur¬ 
prising  because  these  were  two  of  the  best  heuristics 
from  the  consistent  and  inconsistent  tests,  and  semi- 
consistent  matrices  are  a  combination  of  consistent  and 
inconsistent  matrices.  Min-min  was  able  to  do  well  be¬ 
cause  it  searched  the  entire  row  for  each  task  and  as¬ 
signed  a  high  percentage  of  tasks  to  their  first  choice. 
GA  was  robust  enough  to  handle  the  consistent  compo¬ 
nents  of  the  matrices,  and  did  well  for  the  same  reasons 
mentioned  for  inconsistent  matrices. 
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5.  Alternative  Implementations 

The  experimental  results  in  Section  4  show  the  per¬ 
formance  of  each  heuristic  under  the  assumptions  pre¬ 
sented.  For  several  heuristics,  specific  control  param¬ 
eter  values  and  control  functions  had  to  be  selected. 
In  most  cases,  control  parameter  values  and  control 
functions  were  based  on  the  references  cited  or  experi¬ 
ments  conducted.  However,  for  these  heuristics,  differ¬ 
ent,  valid  implementations  are  possible  using  different 
control  parameters  and  control  functions. 

GA,  SA,  GSA:  Several  parameter  values  could 
be  varied  among  these  techniques,  including  (where  ap¬ 
propriate)  population  size,  crossover  probability,  mu¬ 
tation  probability,  stopping  criteria,  number  of  runs 
with  different  initial  populations  per  result,  and  the 
system  temperature.  The  specific  procedures  used  for 
the  following  actions  could  also  be  modified  (where  ap¬ 
propriate)  including  initial  population  “seed”  genera¬ 
tion,  mutation,  crossover,  selection,  elitism,  and  the 
accept/reject  new  mapping  procedure. 

Tabu:  The  short  hop  method  implemented  was  a 
“first  descent”  (take  the  first  improvement  possible) 
method.  “Steepest  descent”  methods  (where  several 
short  hops  are  considered  simultaneously,  and  the  one 
with  the  most  improvement  is  selected)  are  also  used 
in  practice  [7].  Other  techniques  that  could  be  var¬ 
ied  are  the  long  hop  method,  the  order  of  the  short 
hop  pair  generation- and-exchange  sequence,  and  the 
stopping  condition.  Two  possible  alternative  stopping 
criteria  are  when  the  tabu  list  reaches  a  specified  num¬ 
ber  of  entries,  or  when  there  is  no  change  in  the  best 
solution  in  a  specified  number  of  hops. 

A*:  Several  variations  of  the  A*  method  that  was 
employed  here  could  be  implemented.  Different  func¬ 
tions  could  be  used  to  estimate  the  lower-bound 
The  maximum  size  of  the  search  tree  could  be  varied, 
and  several  other  techniques  exist  for  tree  pruning  (e.g., 
[21]). 

In  summary,  for  the  GA,  SA,  GSA,  Tabu,  and  A* 
heuristics  there  are  a  great  number  of  possible  valid 
implementations.  An  attempt  was  made  to  use  a  rea¬ 
sonable  implementation  of  each  heuristic  for  this  study. 
Future  work  could  examine  other  implementations. 

6.  Conclusions 

The  goal  of  this  study  was  to  provide  a  basis  for 
comparison  and  insights  into  circumstances  where  one 
technique  will  out  perform  another  for  eleven  different 
heuristics.  The  characteristics  of  the  ETC  matrices 
used  as  input  for  the  heuristics  and  the  methods  used  to 


generate  them  were  specified.  The  implementation  of 
a  collection  of  eleven  heuristics  from  the  literature  was 
described.  The  results  of  the  mapping  heuristics  were 
discussed,  revealing  the  best  heuristics  to  use  in  certain 
scenarios.  For  the  situations,  implementations,  and  pa¬ 
rameter  values  used  here,  GA  was  the  best  heuristic  for 
most  cases,  followed  closely  by  Min-min,  with  A*  also 
doing  well  for  inconsistent  matrices. 

A  software  tool  was  developed  that  allows  others 
to  compare  these  heuristics  for  many  different  types 
of  ETC  matrices.  These  heuristics  could  also  be  the 
basis  of  a  mapping  toolkit.  If  this  toolkit  were  given  an 
ETC  matrix  representing  an  actual  meta-task  and  an 
actual  HC  environment,  the  toolkit  could  analyze  the 
ETC  matrix,  and  utilize  the  best  mapping  heuristic 
for  that  scenario.  Depending  on  the  overall  situation, 
the  execution  time  of  the  mapping  heuristic  itself  may 
impact  this  decision.  For  example,  if  the  best  mapping 
available  in  less  than  one  minute  was  desired  and  if 
the  characteristics  of  a  given  ETC  matrix  most  closely 
matched  a  consistent  matrix,  Min-min  would  be  used; 
if  more  time  was  available  for  finding  the  best  mapping, 
GA  and  A*  should  be  considered. 

The  comparisons  of  the  eleven  heuristics  and  twelve 
situations  provided  in  this  study  can  be  used  by  re¬ 
searchers  as  a  starting  point  when  choosing  heuristics 
to  apply  in  different  scenarios.  They  can  also  be  used 
by  researchers  for  selecting  heuristics  to  compare  new, 
developing  techniques  against. 
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initial  population  generation; 
evaluation; 

while  (stopping  criteria  not  met)  { 
selection; 
crossover; 
mutation; 
evaluation; 

} 


Figure  1.  General  procedure  for  a  Genetic  Al¬ 
gorithm,  based  on  [26]. 
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Figure  2.  Consistent,  high  task,  high  machine 
heterogeneity. 
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Figure  4.  Consistent,  low  task,  high  machine 
heterogeneity. 
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Figure  5.  Consistent,  low  task,  low  machine 
heterogeneity. 
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Figure  9.  Inconsistent,  low  task,  low  machine 
heterogeneity. 
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Figure  12.  Semi-consistent,  low  task,  high 
machine  heterogeneity. 
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Figure  13.  Semi-consistent,  low  task,  low  ma¬ 
chine  heterogeneity. 
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machines 

t 

436,735.9 

815,309.1 

891,469.0 

1,722,197.6 

1,340,988.1 

740,028.0 

1,749,673.7 

251,140.1 

a 

950,470.7 

933,830.1 

2,156,144.2 

2,202,018.0 

2,286,210.0 

2,779,669.0 

220,536.3 

1,769,184.5 

s 

453,126.6 

479,091.9 

150,324.5 

386,338.1 

401,682.9 

218,826.0 

242,699.6 

11,392.2 

k 

1,289,078.2 

1,400,308.1 

2,378,363.0 

2,458,087.0 

351,387.4 

925,070.1 

2,097,914.2 

1,206,158.2 

s 

646,129.6 

576,144.9 

1,475,908.2 

424,448.8 

576,238.7 

223,453.8 

256,804.5 

88,737.9 

1,061,682.3 

43,439.8 

1,355,855.5 

1,736,937.1 

1,624,942.6 

2,070,705.1 

1,977,650.2 

1,066,470.8 

10,783.8 

7,453.0 

3,454.4 

23,720.8 

29,817.3 

1,143.7 

44,249.2 

5,039.5 

1,940,704.5 

1,682,338.5 

1,978,545.6 

788,342.1 

1,192,052.5 

1,022,914.1 

701,336.3 

1,052,728.3 

Table  1.  Sample  8x8  excerpt  from  ETC  with  Inconsistent,  high  task,  high  machine  heterogeneity. 


machines 

t 

21,612.6 

13,909.7 

6,904.1 

3,621.5 

3,289.5 

8,752.0 

5,053.7 

14,515.3 

a 

578.4 

681.1 

647.9 

477.1 

811.9 

619.5 

490.9 

828.7 

s 

122.8 

236.9 

61.3 

143.6 

56.0 

313.4 

283.5 

241.9 

k 

1,785.7 

1,528.1 

6,998.8 

4,265.3 

3,174.6 

3,438.0 

7,168.4 

2,059.3 

s 

510.8 

472.0 

358.5 

461.4 

1,898.7 

1,535.4 

1,810.2 

906.6 

22,916.7 

18,510.0 

11,932.7 

6,088.3 

9,239.7 

15,036.4 

18,107.7 

12,262.6 

5,985.3 

2,006.5 

1,546.4 

6,444.6 

2,640.0 

7,389.3 

5,924.9 

1,867.2 

16,192.4 

3,088.9 

16,532.5 

13,160.6 

10,574.2 

7,136.3 

15,353.4 

2,150.6 

Table  2.  Sample  8x8  excerpt  from  ETC  with  inconsistent,  high  task,  low  machine  heterogeneity. 


machines 

t 

16,603.2 

71,369.1 

39,849.0 

44,566.1 

55,124.3 

9,077.3 

87,594.5 

31,530.5 

a 

738.3 

2,375.0 

5,606.2 

804.9 

1,535.8 

4,772.3 

994.2 

1,833.9 

s 

1,513.8 

45.1 

1,027.3 

2,962.1 

2,748.2 

2,406.3 

19.4 

969.9 

k 

2,219.9 

5,989.2 

2,747.0 

88.2 

2,055.1 

665.0 

356.3 

2,404.9 

s 

12,654.7 

10,483.7 

10,601.5 

6,804.6 

134.3 

10,532.8 

12,341.5 

5,046.3 

4,226.0 

48,152.2 

11,279.3 

35,471.1 

30,723.4 

24,234.0 

6,366.9 

22,926.9 

20,668.5 

28,875.9 

29,610.1 

7,363.3 

24,488.0 

31,077.3 

8,705.0 

11,849.4 

52,953.2 

14,608.1 

58,137.2 

16,685.5 

36,571.3 

35,888.8 

38,147.0 

15,167.5 

Table  3.  Sample  8x8  excerpt  from  ETC  with  inconsistent,  low  task,  high  machine  heterogeneity. 
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Table  4.  Sample  8x8  excerpt  from  ETC  with  inconsistent,  iow  task,  low  machine  heterogeneity. 
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Abstract 

Dynamic  mapping  (matching  and  scheduling)  heuristics 
for  a  class  of  independent  tasks  using  heterogeneous  dis¬ 
tributed  computing  systems  are  studied.  Two  types  of  map¬ 
ping  heuristics  are  considered:  on-line  and  batch  mode 
heuristics.  Three  new  heuristics,  one  for  batch  and  two  for 
on-line,  are  introduced  as  part  of  this  research.  Simula¬ 
tion  studies  are  performed  to  compare  these  heuristics  with 
some  existing  ones.  In  total,  five  on-line  heuristics  and  three 
batch  heuristics  are  examined.  The  on-line  heuristics  con¬ 
sider,  to  varying  degrees  and  in  different  ways,  task  affinity 
for  different  machines  and  machine  ready  times.  The  batch 
heuristics  consider  these  factors,  as  well  as  aging  of  tasks 
waiting  to  execute.  The  simulation  results  reveal  that  the 
choice  of  mapping  heuristic  depends  on  parameters  such 
as:  (a)  the  structure  of  the  heterogeneity  among  tasks  and 
machines,  (b)  the  optimization  requirements,  and  ( c)  the  ar¬ 
rival  rate  of  the  tasks. 

1.  Introduction 

An  emerging  trend  in  computing  is  to  use  distributed 
heterogeneous  computing  (HC)  systems  constructed  by  net¬ 
working  various  machines  to  execute  a  set  of  tasks  [5,  14]. 
These  HC  systems  have  resource  management  systems 
(RMSs)  to  govern  the  execution  of  the  tasks  that  arrive  for 
service.  This  paper  describes  and  compares  eight  heuris¬ 
tics  that  can  be  used  in  such  an  RMS  for  assigning  tasks  to 
machines. 

In  a  general  HC  system,  dynamic  schemes  are  neces- 

This  research  was  supported  by  the  DARPA/ITO  Quorum  Program  under 
the  NPS  subcontract  numbers  N62271-97-M-0900,  N62271-98-M-0217, 
and  N62271-98-M-0448.  Some  of  the  equipment  used  was  donated  by 
Intel. 
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sary  to  assign  tasks  to  machines  (matching),  and  to  com¬ 
pute  the  execution  order  of  the  tasks  assigned  to  each  ma¬ 
chine  (scheduling)  [3].  In  the  HC  system  considered  here, 
the  tasks  are  assumed  to  be  independent,  i.e.,  no  communi¬ 
cations  between  the  tasks  are  needed.  A  dynamic  scheme  is 
needed  because  the  arrival  times  of  the  tasks  may  be  random 
and  some  machines  in  the  suite  may  go  off-line  and  new  ma¬ 
chines  may  come  on-line.  The  dynamic  mapping  (matching 
and  scheduling)  heuristics  investigated  in  this  study  are  non- 
preemptive,  and  assume  that  the  tasks  have  no  deadlines  or 
priorities  associated  with  them. 

The  mapping  heuristics  can  be  grouped  into  two  cate¬ 
gories:  on-line  mode  and  batch-mode  heuristics.  In  the 
on-line  mode,  a  task  is  mapped  onto  a  machine  as  soon 
as  it  arrives  at  the  mapper.  In  the  batch  mode,  tasks  are 
not  mapped  onto  the  machines  as  they  arrive;  instead  they 
are  collected  into  a  set  that  is  examined  for  mapping  at 
prescheduled  times  called  mapping  events.  The  indepen¬ 
dent  set  of  tasks  that  is  considered  for  mapping  at  the  map¬ 
ping  events  is  called  a  meta-task.  A  meta-task  can  include 
newly  arrived  tasks  (i.e.,  the  ones  arriving  after  the  last 
mapping  event)  and  the  ones  that  were  mapped  in  earlier 
mapping  events  but  did  not  begin  execution.  While  on¬ 
line  mode  heuristics  consider  a  task  for  mapping  only  once, 
batch  mode  heuristics  consider  a  task  for  mapping  at  each 
mapping  event  until  the  task  begins  execution. 

The  trade-offs  between  on-line  and  batch  mode  heuris¬ 
tics  are  studied  experimentally.  Mapping  independent  tasks 
onto  an  HC  suite  is  a  well-known  NP-complete  problem  if 
throughput  is  the  optimization  criterion  [9].  For  the  heuris¬ 
tics  discussed  in  this  paper,  maximization  of  the  throughput 
is  the  primary  objective.  This  performance  metric  is  the 
most  common  one  in  the  production  oriented  environments. 
However,  the  performance  of  the  heuristics  is  examined  us- 
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ing  other  metrics  as  well. 

Three  new  heuristics,  one  for  batch  and  two  for  on-line, 
are  introduced  as  part  of  this  research.  Simulation  studies 
are  performed  to  compare  these  heuristics  with  some  exist¬ 
ing  ones.  In  total,  five  on-line  heuristics  and  three  batch 
heuristics  are  examined.  The  on-line  heuristics  consider,  to 
varying  degrees  and  in  different  ways,  task  affinity  for  dif¬ 
ferent  machines  and  machine  ready  times.  The  batch  heuris¬ 
tics  consider  these  factors,  as  well  as  aging  of  tasks  waiting 
to  execute. 

Section  2  describes  some  related  work.  In  Section  3,  the 
optimization  criterion  and  another  performance  metric  are 
defined.  Section  4  discusses  the  mapping  approaches  stud¬ 
ied  here.  The  simulation  procedure  is  given  in  Section  5. 
Section  6  presents  the  simulation  results. 

This  research  is  part  of  a  DARPA/ITO  Quorum  Program 
project  called  MSHN  (Management  System  for  Heteroge¬ 
neous  Networks)  [8].  MSHN  is  a  collaborative  research 
effort  that  includes  Naval  Postgraduate  School,  NOEMIX, 
Purdue,  and  University  of  Southern  California.  It  builds  on 
SmartNet,  an  operational  scheduling  framework  and  sys¬ 
tem  for  managing  resources  in  an  HC  environment  devel¬ 
oped  at  NRaD  [6],  The  technical  objective  of  the  MSHN 
project  is  to  design,  prototype,  and  refine  a  distributed  re¬ 
source  management  system  that  leverages  the  heterogeneity 
of  resources  and  tasks  to  deliver  the  requested  qualities  of 
service.  The  heuristics  developed  here,  or  their  derivatives, 
may  be  included  in  the  Scheduling  Advisor  component  of 
the  MSHN  prototype. 

2.  Related  Work 

In  the  literature,  mapping  tasks  onto  machines  is  often 
referred  to  as  scheduling.  Several  researchers  have  worked 
on  the  dynamic  mapping  problem  from  areas  including  job 
shop  scheduling  and  distributed  computer  systems  (e.g., 
[10,12,18,20]). 

Some  of  the  heuristics  examined  for  batch-mode  map¬ 
ping  in  this  paper  are  based  on  the  static  heuristics  given  in 
[9].  The  heuristics  presented  in  [9]  are  concerned  with  map¬ 
ping  independent  tasks  onto  heterogeneous  machines  such 
that  the  completion  time  of  the  last  finishing  task  is  min¬ 
imized.  The  problem  is  recognized  as  NP-complete  and 
several  heuristics  are  designed.  Worst  case  performance 
bounds  are  obtained  for  the  heuristics.  The  Min-min  heuris¬ 
tic  that  is  used  here  as  a  benchmark  for  batch  mode  mapping 
is  based  on  the  ideas  presented  in  [9],  and  implemented  in 
SmartNet  [6]. 

In  [10],  a  dynamic  matching  and  scheduling  scheme 
based  on  a  distributed  policy  for  mapping  tasks  onto  HC 
systems  is  provided.  A  task  can  have  several  subtasks,  and 
the  subtasks  can  have  data  dependencies  among  them.  In 
the  scheme  presented  in  [10],  the  subtasks  in  an  application 
receive  information  about  the  subtasks  in  other  applications 


only  in  terms  of  load  estimates  on  the  machines.  Each  appli¬ 
cation  uses  an  algorithm  that  uses  a  weighting  factor  to  de¬ 
termine  the  mapping  for  the  subtasks.  The  weighting  factor 
for  a  subtask  is  derived  by  considering  the  length  of  the  crit¬ 
ical  path  from  the  subtask  to  the  end  of  the  directed  acyclic 
graph  (DAG)  that  represents  the  application.  If  each  appli¬ 
cation  is  an  independent  task  with  no  subtasks,  as  is  the  case 
in  this  paper,  then  the  scheme  presented  in  [10]  is  not  suit¬ 
able,  because  the  mapping  criterion  is  designed  to  exploit 
information  available  in  a  DAG.  Therefore,  the  scheme  pro¬ 
vided  in  [10]  is  not  compared  to  the  heuristics  presented  in 
this  paper. 

Two  dynamic  mapping  approaches,  one  using  a  central¬ 
ized  policy  and  the  other  using  a  distributed  policy,  are  de¬ 
veloped  in  [12].  The  centralized  heuristic  referred  to  therein 
as  the  global  queue  equalization  algorithm  is  similar  to  the 
minimum  completion  time  heuristic  that  is  used  as  a  bench¬ 
mark  in  this  paper  and  described  in  Section  4.  The  heuris¬ 
tic  based  on  the  distributed  policy  uses  a  method  similar  to 
the  minimum  completion  time  heuristic  at  each  node.  The 
mapper  at  a  given  node  considers  the  local  machine  and  the 
k  highest  communication  bandwidth  neighbors  to  map  the 
tasks  in  the  local  queue.  Therefore,  the  mapper  based  on 
the  distributed  strategy  assigns  a  task  to  the  best  machine 
among  the  ik  -i- 1  machines.  The  simulation  results  provided 
in  [12]  show  that  the  centralized  heuristic  always  performs 
better  than  the  distributed  heuristic.  The  heuristics  in  [12] 
are  very  similar  to  the  minimum  completion  time  heuristic 
used  as  a  benchmark  in  this  paper.  Hence,  they  are  not  ex¬ 
perimentally  compared  with  the  heuristics  presented  here. 

In  [18],  a  survey  of  dynamic  scheduling  heuristics  for 
distributed  computing  systems  is  provided.  Most  of  the 
heuristics  featured  in  [18]  perform  load  sharing  to  schedule 
the  tasks  on  different  machines,  not  considering  any  task- 
machine  affinities  while  making  the  mapping  decisions  for 
HC  systems.  In  contrast  to  [18],  these  affinities  are  con¬ 
sidered  to  varying  degrees  in  all  but  one  of  the  heuristics 
examined  in  this  paper. 

A  survey  of  dynamic  scheduling  heuristics  for  job-shop 
environments  is  provided  in  [20].  It  classifies  the  dynamic 
scheduling  algorithms  into  three  approaches:  conventional 
approach,  knowledge-based  approach,  and  distributed  prob¬ 
lem  solving  approach.  The  class  of  heuristics  grouped  under 
the  conventional  approach  are  similar  to  the  minimum  com¬ 
pletion  time  heuristic  considered  in  this  paper,  however,  the 
problem  domains  considered  in  [20]  and  here  differ.  Fur¬ 
thermore,  some  of  the  heuristics  featured  in  [20]  use  prior¬ 
ities  and  deadlines  to  determine  the  task  scheduling  order 
whereas  priorities  and  deadlines  are  not  considered  here. 

In  distributed  computer  systems,  load  balancing  schemes 
have  been  a  popular  strategy  for  mapping  tasks  onto  the  ma¬ 
chines  (e.g.,  [15,  18]).  In  [15],  the  performance  character¬ 
istics  of  simple  load  balancing  heuristics  for  HC  distributed 
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systems  are  studied.  The  heuristics  presented  in  [15]  do  not 
consider  task  execution  times  when  making  their  decisions. 

SmartNet  [6]  is  an  RMS  for  HC  systems  that  employs 
various  heuristics  to  map  tasks  to  machines  considering  re¬ 
source  and  task  heterogeneity.  In  this  paper,  some  appro¬ 
priate  selected  SmartNet  heuristics  are  included  in  the  com¬ 
parative  study. 

3.  Performance  Metrics 

The  expected  execution  time  etj  of  task  tj_  on  machine 
mj  is  defined  as  the  amount  of  time  taken  by  mj  to  ex- 
^te  ti  given  mj  has  no  load  when  ti  is  assigned.  The 
expected  completion  time  cij  of  task  ti  on  machine  is  de¬ 
fined  as  the  wall-clock  time  at  which  mj  completes  tt  (after 
having  finished  any  previously  assigned  tasks).  Let  m  be  the 
total  number  of  the  machines  in  the  HC  suite.  Let  K  be  the 
set  containing  the  tasks  that  will  be  used  in  a  given  test  set 
for  evaluating  heuristics  in  the  study.  Let  the  mval  time 
of  the  task  ti  be  and  let  the  begin  time  of  U  be  From 
the  above  definitions,  Cij  =  bi  H-  eij.  Let  c/  be  Cij,  where 
machine  j  is  assigned  to  execute  task  i.  The  makespan  for 
the  complete  schedule  is  then  defined  as  maxt^^fcici)  [17]. 
Makespan  is  a  measure  of  the  throughput  of  the  HC  system, 
and  does  not  measure  the  quality  of  service  imparted  to  an 
individual  task. 

Recall  from  Section  1,  in  on-line  mode,  the  mapper  as¬ 
signs  a  task  to  a  machine  as  soon  as  the  task  arrives  at 
the  mapper,  and  in  batch  mode  a  set  of  independent  tasks 
that  need  to  be  mapped  at  a  mapping  event  is  called  a 
meta-task.  (In  some  systems,  the  term  meta-task  is  de¬ 
fined  in  a  way  that  allows  inter-task  dependencies.)  In 
batch  mode,  for  the  i-th  mapping  event,  the  meta-task  ^ 
is  mapped  at  time  Ti,  where  i  >  0.  The  initial  meta-task, 
Mo,  consists  of  all  the  tasks  that  arrived  prior  to  time  Tq, 
i.e..  Mo  —  {tj\aj<  To}.  The  meta-task,  Mjt,  for  ^  >  0,  con¬ 
sists  of  tasks  that  arrived  after  the  last  mapping  event  and 
the  tasks  that  had  been  mapped,  but  did  not  start  executing, 
i.e.,  Mk  =  {tj  I  Tt_i  <  aj  <  T*}  U  {tj  |  aj  <  Xk-l,bj  >  T*}. 
The  waiting  time  for  task  tj  is  defined  sls  bj  —  aj.  Let  cj  be 
the  completion  time  of  task  tj  if  it  is  the  only  task  that  is  ex¬ 
ecuting  on  the  system.  The  sharing  penalty  (p j)  for  the  task 
tj  is  defined  as  {cj  -  Cj).  The  average  sharing  penalty  for 
the  tasks  in  the  set  K  is  given  by  [LtjeKPj]/\  ^  I-  The  av¬ 
erage  sharing  penalty  for  a  set  of  tasks  mapped  by  a  given 
heuristic  is  an  indication  of  the  heuristic’s  ability  to  mini¬ 
mize  the  effects  of  contention  among  different  tasks  in  the 
set.  It  therefore  indicates  quality  of  service  provided  to  an 
individual  task,  as  gauged  by  the  wait  incurred  by  the  task 
before  it  begins  and  the  time  to  perform  the  actual  compu¬ 
tation.  Other  performance  metrics  are  considered  in  [13]. 


4.  Mapping  Heuristics 
4.1.  Overview 

In  the  on-line  mode  heuristics,  each  task  is  considered 
only  once  for  matching  and  scheduling,  i.e.,  the  mapping  is 
not  changed  once  it  is  computed.  When  the  arrival  rate  is 
low,  machines  may  be  ready  to  execute  a  task  as  soon  as  it 
arrives  at  the  mapper.  Therefore,  it  may  be  beneficial  to  use 
the  mapper  in  the  on-line  mode  so  that  a  task  need  not  wait 
until  the  next  mapping  event  to  begin  its  execution. 

In  batch  mode,  the  mapper  considers  a  meta-task  for 
matching  and  scheduling  at  each  mapping  event.  This  en¬ 
ables  the  mapping  heuristics  to  possibly  make  better  deci¬ 
sions,  because  the  heuristics  have  the  resource  requirement 
information  for  a  whole  meta-task,  and  know  about  the  ac¬ 
tual  execution  times  of  a  larger  number  of  tasks  (as  more 
tasks  might  complete  while  waiting  for  the  mapping  event). 
When  the  task  arrival  rate  is  high,  there  will  be  a  sufficient 
number  of  tasks  to  keep  the  machines  busy  in  between  the 
mapping  events,  and  while  a  mapping  is  being  computed.  It 
is,  however,  assumed  in  this  study  that  the  running  time  of 
the  heuristic  is  negligibly  small  as  compared  to  the  average 
task  execution  time. 

Both  on-line  and  batch  mode  heuristics  assume  that 
estimates  of  expected  task  execution  times  on  each  ma¬ 
chine  in  the  HC  suite  are  known.  The  assumption  that 
these  estimated  expected  times  are  known  is  commonly 
made  when  studying  mapping  heuristics  for  HC  systems 
(e.g.,[7,  11,  19]).  (Approaches  for  doing  this  estimation 
based  on  task  profiling  and  analytical  benchmarking  are  dis¬ 
cussed  in  [14]  .)  These  estimates  can  be  supplied  before  a 
task  is  submitted  for  execution,  or  at  the  time  it  is  submit¬ 
ted.  (The  use  of  some  of  the  heuristics  studied  here  in  a 
static  environment  is  discussed  in  [4].) 

The  ready  time  of  a  machine  is  quantified  by  the  earliest 
time  that  machine  is  going  to  be  ready  after  completing  the 
execution  of  the  tasks  that  are  currently  assigned  to  it.  It 
is  assumed  that  each  time  a  task  ti  completes  on  a  machine 
nij  a  report  is  sent  to  the  mapper.  Because  the  heuristics 
presented  here  are  dynamic,  the  expected  machine  ready 
times  are  based  on  a  combination  of  actual  task  execution 
times  and  estimated  expected  task  execution  times.  The  ex¬ 
periments  presented  in  Section  6  model  this  situation  us¬ 
ing  simulated  actual  values  for  the  execution  times  of  the 
tasks  that  have  already  finished  their  execution.  Also,  all 
heuristics  examined  here  operate  in  a  centralized  fashion  on 
a  dedicated  suite  of  machines;  i.e.,  the  mapper  controls  the 
execution  of  all  jobs  on  all  machines  in  the  suite.  It  is  also 
assumed  that  the  mapping  heuristic  is  being  run  on  a  sepa¬ 
rate  machine. 
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4.2.  On-line  mode  mapping  heuristics 

The  MCT  (minimum  completion  time)  heuristic  assigns 
each  task  to  the  machine  that  results  in  that  task’s  earliest 
completion  time.  This  causes  some  tasks  to  be  assigned  to 
machines  that  do  not  have  the  minimum  execution  time  for 
them.  The  MCT  heuristic  is  a  variant  of  the  fast-greedy 
heuristic  from  SmartNet  [6].  The  MCT  heuristic  is  used  as 
a  benchmark  for  the  on-line  mode,  i.e.,  the  performance  of 
the  other  heuristics  is  compared  against  that  of  the  MCT 
heuristic. 

As  a  task  arrives,  all  the  machines  in  the  HC  suite  are 
examined  to  determine  the  machine  that  gives  the  earliest 
completion  time  for  the  task.  Therefore,  it  takes  0{m)  time 
to  map  a  given  task. 

The  MET  (minimum  execution  time)  heuristic  assigns 
each  task  to  the  machine  that  performs  that  task’s  compu¬ 
tation  in  the  least  amount  of  execution  time  (this  heuristic  is 
also  known  as  LB  A  (limited  best  assignment)  [1]  and  UDA 
(user  directed  assignment)  [6]).  This  heuristic,  in  contrast  to 
MCT,  does  not  consider  machine  ready  times.  This  heuristic 
can  cause  a  severe  imbalance  in  load  across  the  machines. 
The  advantages  of  this  method  are  that  it  gives  each  task 
to  the  machine  that  performs  it  in  the  least  amount  of  exe¬ 
cution  time,  and  the  heuristic  is  very  simple.  The  heuristic 
needs  0{m)  time  to  find  the  machine  that  has  the  minimum 
execution  time  for  a  task. 

The  SA  (switching  algorithm)  heuristic  is  motivated  by 
the  following  observations.  The  MET  heuristic  can  poten¬ 
tially  create  load  imbalance  across  machines  by  assigning 
many  more  tasks  to  some  machines  than  to  others,  whereas 
the  MCT  heuristic  tries  to  balance  the  load  by  assigning 
tasks  for  earliest  completion  time.  If  the  tasks  are  arriving 
in  a  random  mix,  it  is  possible  to  use  MET  at  the  expense 
of  load  balance  until  a  given  threshold  and  then  use  MCT  to 
smooth  the  load  across  the  machines.  The  SA  heuristic  uses 
the  MCT  and  MET  heuristics  in  a  cyclic  fashion  depending 
on  the  load  distribution  across  the  machines.  The  purpose  is 
to  have  a  heuristic  with  the  desirable  properties  of  both  the 
MCT  and  the  MET. 

Let  the  maximum  ready  time  over  all  machines  in  the 
suite  be  r^,  and  the  minimum  ready  time  be  rmm-  Then, 
the  load  balance  index  across  the  machines  is  given  by  7t  = 
rminif'max’  The  paramjgter  n  can  have  any  value  in  the  inter¬ 
val  [0, 1].  If  71  is  1.0,  then  the  load  is  evenly  balanced  across 
the  machines.  If  n  is  0,  then  at  least  one  machine  has  not  yet 
been  assigned  a  task.  Two  threshold  values,  ^  (low)  and  ^ 
(high),  for  the  ratio  n  are  chosen  in  [0, 1]  such  that  71/  <  Ti/j. 
Initially,  the  value  of  7t  is  set  to  0.0.  The  SA  heuristic  begins 
mapping  tasks  using  the  MCT  heuristic  until  the  value  of 
load  balance  index  increases  to  at  least  7i/j,  After  that  point 
in  time,  the  SA  heuristic  begins  using  the  MET  heuristic  to 
perform  task  mapping.  This  causes  the  load  balance  index 


to  decrease.  When  it  reaches  7C/,  the  SA  heuristic  switches 
back  to  using  the  MCT  heuristic  for  mapping  the  tasks  and 
the  cycle  continues. 

As  an  example  of  functioning  of  the  S  A  with  lower  and 
upper  limits  of  0.6  and  0.9,  respectively,  for  |  AT  |  =  1000,  the 
SA  switched  between  the  MET  and  the  MCT  two  times,  as¬ 
signing  715  tasks  using  the  MCT.  For  |  AT  |=  2000,  the  SA 
switched  five  times,  using  the  MCT  to  assign  1080  tasks. 
The  percentage  of  tasks  assigned  using  MCT  gets  progres¬ 
sively  smaller  for  larger  |  AT  |.  This  is  because  an  MET  as¬ 
signment  in  a  highly  loaded  system  will  bring  a  smaller  de¬ 
crease  in  load  balance  index  than  when  the  same  assignment 
is  made  in  a  lightly  loaded  system.  Therefore  many  more 
MET  assignments  can  be  made  in  a  highly  loaded  system 
before  the  load  balance  index  falls  below  the  lower  thresh¬ 
old. 

At  each  task’s  arrival,  the  SA  heuristic  determines  the 
load  balance  index.  In  the  worst  case,  this  takes  0{m)  time. 
In  the  next  step,  the  time  taken  to  assign  a  task  to  a  ma¬ 
chine  is  of  order  0(m)  whether  SA  uses  the  MET  to  per¬ 
form  the  mapping  or  the  MCT.  Overall,  the  SA  heuristic 
takes  0{m)  time  irrespective  of  which  heuristic  is  actually 
used  for  mapping  the  task. 

The  KPB  (k-percent  best)  heuristic  considers  only  a  sub¬ 
set  of  machines  while  mapping  a  task.  The  subset  is  formed 
by  picking  the  {km/ 100)  best  machines  based  on  the  execu¬ 
tion  times  for  the  task,  where  100/m  <k<  100.  The  task 
is  assigned  to  a  machine  that  provides  the  earliest  comple¬ 
tion  time  in  the  subset.  If  A:  =  100,  then  the  KPB  heuristic  is 
reduced  to  the  MCT  heuristic.  lfk=  100/m,  then  the  KPB 
heuristic  is  reduced  to  the  MET  heuristic.  A  “good”  value  of 
k  maps  a  task  to  a  machine  only  within  a  subset  formed  from 
computationally  superior  machines.  The  purpose  is  not  as 
much  as  matching  of  the  current  task  to  a  computationally 
well-matched  machine  as  it  is  to  avoid  putting  the  current 
task  onto  a  machine  which  might  be  more  suitable  for  some 
yet-to-arrive  tasks.  This  “foresight”  about  task  heterogene¬ 
ity  lacks  in  the  MCT  which  might  assign  a  task  to  a  poorly 
matched  machine  for  an  immediate  marginal  improvement 
in  completion  time,  possibly  depriving  some  subsequently 
arriving  tasks  of  that  machine,  and  eventually  leading  to 
a  larger  makespan  as  compared  to  the  KPB.  It  should  be 
noted  that  while  both  the  KPB  and  SA  have  elements  of  the 
MCT  and  the  MET  in  their  operation,  it  is  only  in  the  KPB 
that  each  task  assignment  attempts  to  optimize  objectives  of 
the  MCT  and  the  MET  simultaneously.  However,  in  cases 
where  a  fixed  subset  of  machines  is  not  among  the  k%  best 
for  any  task,  the  KPB  will  cause  much  machine  idle  time 
compared  to  the  MCT,  and  can  result  in  much  poorer  per¬ 
formance. 

For  each  task,  O(mlogm)  time  is  spent  in  ranking  the 
machines  for  determining  the  subset  of  machines  to  exam¬ 
ine.  Once  the  subset  of  machines  is  determined,  it  takes 
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0{^)  time,  i.e.,  0{m)  time  to  determine  the  machine  as¬ 
signment.  Overall  the  heuristic  takes  (9(mlogm)  time. 

The  OLB  (opportunistic  load  balancing)  heuristic  as¬ 
signs  the  task  to  the  machine  that  becomes  ready  next.  It 
does  not  consider  the  execution  time  of  the  task  when  map¬ 
ping  it  onto  a  machine.  If  multiple  machines  become  ready 
at  the  same  time,  then  one  machine  is  arbitrarily  chosen. 

The  complexity  of  the  OLB  heuristic  is  dependent  on  the 
implementation.  In  the  implementation  considered  here,  the 
mapper  may  need  to  examine  all  m  machines  to  find  the 
machine  that  becomes  ready  next.  Therefore,  it  takes  0{m) 
to  find  the  assignment.  Other  implementations  may  require 
idle  machines  to  assign  tasks  to  themselves  by  accessing  a 
shared  global  queue  of  tasks  [21]. 

4.3.  Batch  mode  mapping  heuristics 

In  the  batch  mode  heuristics,  meta-tasks  are  mapped  af¬ 
ter  predefined  intervals.  These  intervals  are  defined  in  this 
study  using  one  of  the  two  strategies  proposed  below. 

The  regular  time  interval  strategy  maps  the  meta-tasks  at 
regular  intervals  of  time  except  when  all  machines  are  busy. 
When  all  machines  are  busy,  all  scheduled  mapping  events 
that  precede  the  one  before  the  expected  ready  time  of  the 
machine  that  finishes  earliest  are  canceled. 

The  fixed  count  strategy  maps  a  meta-task  M/  as  soon  as 
one  of  the  following  two  mutually  exclusive  conditions  are 
met:  (a)  an  arriving  task  makes  |  Mi  \  larger  than  or  equal  to 
a  predetermined  arbitrary  number  K,  or  (b)  all  tasks  have  ar¬ 
rived,  and  a  task  completes  while  the  number  of  tasks  which 
yet  have  to  begin  is  larger  than  or  equal  to  K.  In  this  strat¬ 
egy,  the  length  of  the  mapping  intervals  will  depend  on  the 
arrival  rate  and  the  completion  rate.  The  possibility  of  ma¬ 
chines  being  idle  while  waiting  for  the  next  mapping  event 
will  depend  on  the  arrival  rate,  completion  rate,  w,  and  K. 

The  batch  mode  heuristics  considered  in  this  study  are 
discussed  in  the  paragraphs  below.  The  complexity  analy¬ 
sis  performed  for  these  heuristics  considers  a  single  map¬ 
ping  event.  In  the  complexity  analysis,  the  meta-task 
size  is  assumed  to  be  equal  to  the  average  of  meta-task 
sizes  at  all  actually  performed  mapping  events.  Let  the 
average  meta-task  size  be  5. 

The  Min-min  heuristic  shown  in  Figure  1  is  from  Smart- 
Net  [6].  In  Figure  1,  let  rj  denote  the  expected  time  ma¬ 
chine  mj  will  become  ready  to  execute  a  task  after  finishing 
the  execution  of  all  tasks  assigned  to  it  at  that  point  in  time. 
First  the  cij  entries  are  computed  using  the  eij  and  rj  values. 
For  each  task  ti  the  machine  that  gives  the  earliest  expected 
completion  time  is  determined  by  scanning  the  rows  of  the 
c  matrix.  The  task  4  that  has  the  minimum  earliest  expected 
completion  time  is  determined  and  then  assigned  to  the  cor¬ 
responding  machine.  The  matrix  c  and  vector  r  are  updated 
and  the  above  process  is  repeated  with  tasks  that  have  not 
yet  been  assigned  a  machine. 


Min-min  begins  by  scheduling  the  tasks  that  change  the 
expected  machine  ready  time  status  by  the  least  amount  that 
any  assignment  could.  If  tasks  ti  and  tk  are  contending  for 
a  particular  machine  mj,  then  Min-min  assigns  mj  to  the 
task  (say  ti)  that  will  change  the  ready  time  of  nij  less.  This 
increases  the  probability  that  4  will  still  have  its  earliest 
completion  time  on  mj,  and  shall  be  assigned  to  it.  Be¬ 
cause  at  f  =  0,  the  machine  which  finishes  a  task  earliest 
is  also  the  one  that  executes  it  fastest,  and  from  thereon 
Min-min  heuristic  changes  machine  ready  time  status  by  the 
least  amount  for  every  assignment,  the  percentage  of  tasks 
assigned  their  first  choice  (on  basis  of  expected  execution 
time)  is  likely  to  be  higher  in  Min-min  than  with  the  other 
batch  mode  heuristics  described  in  this  section.  The  expec¬ 
tation  is  that  a  smaller  makespan  can  be  obtained  if  a  larger 
number  of  tasks  is  assigned  to  the  machines  that  not  only 
complete  them  earliest  but  also  execute  them  fastest. 

(1 )  for  all  tasks  ti  in  meta-task  My  (in  an  arbitrary  order) 

(2)  for  all  machines  mj  (in  a  fixed  arbitrary  order) 

(3)  Cij -eij -hr j 

(4)  do  until  all  tasks  in  My  are  mapped 

(5)  for  each  task  In  My  find  the  earliest  completion 

time  and  the  machine  that  obtains  it 

(6)  find  the  task  4  with  the  minimum  earliest 

completion  time 

(7)  assign  task  4  to  the  machine  m/  that  gives  the 

(8)  earliest  completion  time 

(9)  delete  task  4  from  My 

(10)  updater/ 

(1 1 )  update  Cii  for  ail  i 

(12) enddo 

Figure  1.  The  Min-min  heuristic. 

The  initialization  of  the  c  matrix  in  Line  (1)  to  Line  (3) 
takes  0{Sm)  time.  The  do  loop  of  the  Min-min  heuristic 
is  repeated  S  times  and  each  iteration  takes  0{Sm)  time. 
Therefore,  the  heuristic  takes  0{S^m)  time. 

The  Max-min  heuristic  is  similar  to  the  Min-min  heuris¬ 
tic  given  in  Figure  1.  It  is  also  from  SmartNet  [6].  Once  the 
machine  that  provides  the  earliest  completion  time  is  found 
for  every  task,  the  task  4  that  has  the  maximum  earliest 
completion  time  is  determined  and  then  assigned  to  the  cor¬ 
responding  machine.  The  matrix  c  and  vector  r  are  updated 
and  the  above  process  is  repeated  with  tasks  that  have  not 
yet  been  assigned  a  machine.  The  Max-min  heuristic  has 
the  same  complexity  as  the  Min-min  heuristic. 

The  Max-min  is  likely  to  do  better  than  the  Min-min 
heuristic  in  the  cases  where  we  have  many  more  shorter 
tasks  than  the  long  tasks.  For  example,  if  there  is  only  one 
long  task,  Max-min  will  execute  many  short  tasks  concur¬ 
rently  with  the  long  task.  The  resulting  makespan  might 
just  be  determined  by  the  execution  time  of  the  long  task 
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in  these  cases.  Min-min,  however,  first  finishes  the  shorter 
tasks  (which  may  be  more  or  less  evenly  distributed  over 
the  machines)  and  then  executes  the  long  task,  increasing 
the  makespan. 

The  Suffierage  heuristic  is  based  on  the  idea  that  better 
mappings  can  be  generated  by  assigning  a  machine  to  a  task 
that  would  “suffer”  most  in  terms  of  expected  completion 
time  if  that  particular  machine  is  not  assigned  to  it.  Let 
the  sufferage  value  of  a  task  U  be  the  difference  between  its 
second  earliest  completion  time  (on  some  machine  niy)  and 
its  earliest  completion  time  (on  some  machine  nix)-  That  is, 
using  Mx  will  result  in  the  best  completion  time  for  tj,  and 
using  niy  the  second  best. 

Figure  2  shows  the  Sufferage  heuristic.  The  initialization 
phase  in  Lines  (1)  to  (3)  is  similar  to  the  ones  in  the  Min-min 
and  Max-min  heuristics.  Initially  all  machines  are  marked 
unassigned.  In  each  iteration  of  the  for  loop  in  Lines  (6)  to 

(14),  pick  arbitrarily  a  task  tk  from  the  meta-task.  Find  the 
machine  mj  that  gives  the  earliest  completion  time  for  task 
4,  and  tentatively  assign  mj  to  4  if  mj  is  unassigned.  Mark 
nij  as  assigned,  and  remove  4  from  meta-task.  If,  how¬ 
ever,  machine  mj  has  been  previously  assigned  to  a  task  //, 
choose  from  ti  and  4  the  task  that  has  the  higher  sufferage 
value,  assign  mj  to  the  chosen  task,  and  remove  the  cho¬ 
sen  task  from  the  meta-task.  The  unchosen  task  will  not  be 
considered  again  for  this  execution  of  the  for  statement,  but 
shall  be  considered  for  the  next  iteration  of  the  do  loop  be¬ 
ginning  on  Line  (4).  When  all  the  iterations  of  the  for  loop 
are  completed  (i.e.,  when  one  execution  of  the  for  statement 
is  completed),  update  the  machine  ready  time  of  the  each 
machine  assigned  a  new  task.  Perform  the  next  iteration  of 
the  do  loop  beginning  on  Line  (4)  until  all  tasks  have  been 
mapped. 

Table  1  shows  a  scenario  in  which  the  Sufferage  will 
outperform  the  Min-min.  Table  1  shows  the  expected  ex¬ 
ecution  time  values  for  four  tasks  on  four  machines  (all  ini¬ 
tially  idle).  In  this  particular  case,  the  Min-min  heuristic 
gives  a  makespan  of  9.3  and  the  Sufferage  heuristic  gives  a 
makespan  of  7.8.  Figure  3  gives  a  pictorial  representation 
of  the  assignments  made  for  the  case  in  Table  1. 

From  the  pseudo  code  given  in  Figure  2,  it  can  be  ob¬ 
served  that  first  execution  of  the  for  statement  on  Line  (6) 
takes  0{Sm)  time.  The  number  of  task  assignments  made 
in  one  execution  of  this  for  statement  depends  on  the  total 
number  of  machines  in  the  HC  suite,  the  number  of  ma¬ 
chines  that  are  being  contended  for  among  different  tasks, 
and  the  number  of  tasks  in  the  meta-task  being  mapped.  In 
the  worst  case,  only  one  task  assignment  will  be  made  in 
each  execution  of  the  for  statement.  Then  meta-task  size 
will  decrease  by  one  at  each  for  statement  execution.  The 
outer  do  loop  will  be  iterated  S  times  to  map  the  whole 
meta-task.  Therefore,  in  the  worst  case,  the  time  T{S)  taken 
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Table  1.  An  example  expected  execution  time 
matrix  that  illustrates  the  situation  where  the 
Sufferage  heuristic  outperforms  the  Min-min 
heuristic. 

to  map  a  meta-task  of  size  S  will  be 

T{S)  =  Sm-\-  {S -  l)m+  {S -2)m  +  -  •  +  m 
T{S)  =  OiS^m) 

In  the  best  case,  there  are  as  many  machines  as  there  are 
tasks  in  the  meta-task,  and  there  is  no  contention  among  the 
tasks.  Then  all  the  task  are  assigned  in  the  first  execution  of 
the  for  statement  so  that  T {S)  =  0{Sm).  Let  m  be  a  number 
quantifying  the  extent  of  contention  among  the  tasks  for  the 
different  machines.  The  running  time  of  Sufferage  heuristic 
can  then  be  given  as  O(o)5m)  time,  where  1  <  co  <  S.  It  can 
be  seen  that  co  is  equal  to  S  in  the  worst  case,  and  is  1  in 
the  best  case;  these  values  of  co  are  numerically  equal  to  the 
number  of  iterations  of  the  do  loop  on  Line  (4). 

(1 )  for  all  tasks  4  in  meta-task  My  (in  an  arbitrary  order) 

(2)  for  all  machines  mj  (in  a  fixed  arbitrary  order) 

(3)  Ckj  =  ekj-\-rj 

(4)  do  until  all  tasks  in  My  are  mapped 

(5)  mark  all  machines  as  unassigned 

(6)  for  each  task  4  in  My  (in  an  arbitrary  order) 

(7)  find  machine  mj  that  gives  the  earliest 

completion  time 

(8)  sufferage  value  =  second  earliest  completion 

time  -  earliest  completion  time 

(9)  if  machine  mj  is  unassigned 

(1 0)  assign  4  to  machine  mj,  delete  4 
from  Afy,  mark  mj  assigned 

(11)  else 

(1 2)  if  sufferage  value  of  task  ti  already 
assigned  to  mj  is  less  than  the 
sufferage  value  of  task  4 

(1 3)  unassign  ti,  add  ti  back  to  My, 
assign  4  to  machine  mj, 
delete  4  from  My 

(14)  endfor 

(1 5)  update  the  vector  r  based  on  the  tasks  that 

were  assigned  to  the  machines 

(16)  update  the  c  matrix 

(17) enddo 

Figure  2.  The  Sufferage  heuristic. 
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The  batch  mode  heuristics  can  cause  some  tasks  to  be 
starved  of  machines.  Let  be  a  subset  of  meta-task  M/ 
consisting  of  tasks  that  were  mapped  (as  part  of  M,)  at  the 
mapping  event  i  at  time  T/  but  did  not  begin  execution  by 
the  next  mapping  event  at  Ti+i.  Hi  is  the  subset  of  M/  that 
is  included  in  M/+i .  Due  to  the  expected  heterogeneous  na¬ 
ture  of  the  tasks,  the  meta-task  may  be  so  mapped  that 
some  or  all  of  the  tasks  arriving  between  X/  and  T/+i  may 
begin  executing  before  the  tasks  in  set  Hi  do.  It  is  possible 
that  some  or  all  of  the  tasks  in  Hi  may  be  included  in  Hi^i . 
This  probability  increases  as  the  number  of  new  tasks  ar¬ 
riving  between  X/  and  X/+i  increases.  In  general,  some  tasks 
may  be  remapped  at  each  successive  mapping  event  without 
actually  beginning  execution  (i.e.,  the  task  is  starving  for  a 
machine). 

■  task  to  CZ]  task 

■  task  □  task 

bar  heights  are  proportional 
to  task  execution  times 


using  Min-min  using  Sufferage 

Figure  3.  An  example  scenario  (based  on  Ta¬ 
ble  1)  where  the  Sufferage  gives  a  shorter 
makespan  than  the  Min-min. 

To  reduce  starvation,  aging  schemes  are  implemented. 
The  age  of  a  task  is  set  to  zero  when  it  is  mapped  for  the 
first  time  and  incremented  by  one  each  time  the  task  is 
remapped.  Let  a  be  a  constant  that  can  be  adjusted  em¬ 
pirically  to  change  the  extent  to  which  aging  affects  the  op¬ 
eration  of  the  heuristic.  An  aging  factor,  ^  =  (1  +  age/ a), 
is  then  computed  for  each  task.  For  the  experiments  in  this 
study,  a  is  set  to  10.  The  aging  factor  is  used  to  enhance 
the  probability  of  an  “older”  task  beginning  before  the  tasks 
that  would  otherwise  begin  first.  In  the  Min-min  heuristic, 
for  each  task,  the  completion  time  obtained  in  Line  (5)  of 
Figure  1  is  multiplied  by  the  corresponding  value  for  As 
the  age  of  a  task  increases,  its  age-compensated  expected 


completion  time  (i.e.,  one  used  to  determine  the  mapping) 
gets  increasingly  smaller  than  its  original  expected  comple¬ 
tion  time.  This  increases  its  probability  of  being  selected  in 
Line  (6)  in  Figure  1. 

Similarly,  for  the  Max-min  heuristic,  the  completion 
time  of  a  task  is  multiplied  by  In  the  Sufferage  heuris¬ 
tic,  the  sufferage  value  computed  in  Line  (8)  in  Figure  2  is 
multiplied  by 

5.  Simulation  Procedure 

The  mappings  are  simulated  using  a  discrete  event  sim¬ 
ulator.  The  task  arrivals  are  modeled  by  a  Poisson  random 
process.  The  simulator  contains  an  ETC  (expected  time  to 
compute)  matrix  that  contains  the  expected  execution  times 
of  a  task  on  all  machines,  for  all  the  tasks  that  can  arrive  for 
service.  The  ETC  matrix  entries  used  in  the  simulation  stud¬ 
ies  represent  the  Cij  values  that  the  heuristic  would  use  in  its 
operation.  The  actual  execution  time  of  a  task  can  be  differ¬ 
ent  from  the  value  given  by  the  ETC  matrix.  This  variation 
is  modeled  by  generating  a  simulated  actual  execution  time 
for  each  task  by  sampling  a  truncated  Gaussian  probabil¬ 
ity  density  function  with  variance  equal  to  three  times  the 
expected  execution  time  of  the  task  and  mean  equal  to  the 
expected  execution  time  of  the  task  [2,  16].  If  the  sampling 
results  in  a  negative  value,  the  value  is  discarded  and  the 
same  probability  density  function  is  sampled  again.  This 
process  is  repeated  until  a  positive  value  is  returned  by  the 
sampling  process. 

In  an  ETC  matrix,  the  numbers  along  a  row  indicate 
the  execution  times  of  the  corresponding  task  on  differ¬ 
ent  machines.  The  average  variation  along  the  rows  is  re¬ 
ferred  to  as  the  machine  heterogeneity  [2].  Similarly,  the 
average  variation  along  the  columns  is  referred  to  as  the 
task  heterogeneity  [2].  One  classification  of  heterogeneity 
is  to  divide  it  into  high  heterogeneity  and  low  heterogene¬ 
ity.  Based  on  the  above  idea,  four  categories  were  proposed 
for  the  ETC  matrix  in  [2]:  (a)  high  task  heterogeneity  and 
high  machine  heterogeneity  (HiHi),  (b)  high  task  hetero¬ 
geneity  and  low  machine  heterogeneity  (HiLo),  (c)  low  task 
heterogeneity  and  high  machine  heterogeneity  (LoHi),  and 
(d)  low  task  heterogeneity  and  low  machine  heterogeneity 
(LoLo).  The  ETC  matrix  can  be  further  classified  into  two 
classes,  consistent  and  inconsistent,  which  are  orthogonal 
to  the  previous  classifications.  For  a  consistent  ETC  ma¬ 
trix,  if  machine  nix  has  a  lower  execution  time  than  ma¬ 
chine  ruy  for  task  4,  then  the  same  is  true  for  any  task  t/. 
The  ETC  matrices  that  are  not  consistent  are  inconsistent 
ETC  matrices.  In  addition  to  the  consistent  and  inconsis¬ 
tent  classes,  a  semi-consistent  class  could  also  be  defined. 
A  semi-consistent  ETC  matrix  is  characterized  by  a  consis¬ 
tent  sub-matrix.  In  the  semi-consistent  ETC  matrices  used 
here,  50%  of  the  tasks  and  25%  of  the  machines  define  a 
consistent  sub-matrix.  Furthermore,  it  is  assumed  that  for  a 


36 


particular  task  the  execution  times  that  fall  within  the  con¬ 
sistent  sub-matrix  are  smaller  than  those  that  fall  out.  This 
assumption  is  justified  because  the  machines  that  perform 
consistently  better  than  the  others  for  some  tasks  are  more 
likely  to  be  very  much  faster  for  those  tasks  than  very  much 
slower. 

Let  an  ETC  matrix  have  tmax  rows  and  rrimax  columns. 
Random  ETC  matrices  that  belong  to  the  different  cate¬ 
gories  are  generated  in  the  following  manner: 

1 .  Let  Ti  be  an  arbitrary  constant  quantifying  task  hetero¬ 
geneity,  being  smaller  for  low  task  heterogeneity.  Let 
^  be  a  number  picked  from  the  uniform  random  dis¬ 
tribution  with  range  [IjF/]. 

2.  Let  ^  be  an  arbitrary  constant  quantifying  machine 
heterogeneity,  being  smaller  for  low  machine  hetero¬ 
geneity.  Let  ^  be  a  number  picked  from  the  uniform 
random  distribution  with  range  [  1 ,  Tm] . 

3 .  Sample  Nt  tmax  times  to  get  a  vector  ^[0. .  {tmax  "■  1 )]  • 

4.  Generate  the  ETC  matrix,  f[0..(/wax  -  - 

1)]  by  the  following  algorithm: 

for  ti  from  0  to  {tmax  —  1) 

for  rrij  from  0  to  {nimax  ““  1) 
pick  a  new  value  for  Nm 
e[i,j]  =  q[i] 
endfor 
endfor 

From  the  raw  ETC  matrix  generated  above,  a  semi- 
consistent  matrix  could  be  generated  by  sorting  the  execu¬ 
tion  times  for  a  random  subset  of  the  tasks  on  a  random 
subset  of  machines.  An  inconsistent  ETC  matrix  could  be 
obtained  simply  by  leaving  the  raw  ETC  matrix  as  such. 
Consistent  ETC  matrices  were  not  considered  in  this  study 
because  they  are  least  likely  to  arise  in  the  current  intended 
MSHN  environment. 

In  the  experiments  described  here,  the  values  of  for 
low  and  high  task  heterogeneities  are  1000  and  3000,  re¬ 
spectively.  The  values  of  F^  for  low  and  high  machine  het¬ 
erogeneities  are  10  and  100,  respectively.  These  heteroge¬ 
neous  ranges  are  based  on  one  type  of  expected  environment 
for  MSHN. 

6.  Experimental  Results  and  Discussion 
6.1.  Overview 

The  experimental  evaluation  of  the  heuristics  is  per¬ 
formed  in  three  parts.  In  the  first  part,  the  on-line  mode 
heuristics  are  compared  using  various  metrics.  The  sec¬ 
ond  part  involves  a  comparison  of  the  batch  mode  heuris¬ 
tics.  The  third  part  is  the  comparison  of  the  batch  mode  and 


the  on-line  mode  heuristics.  Unless  stated  otherwise,  the 
following  are  valid  for  the  experiments  described  here.  The 
number  of  machines  is  held  constant  at  20,  and  the  experi¬ 
ments  are  performed  for  |  AT  |  =  { 1000,  2000}.  All  heuris¬ 
tics  are  evaluated  in  a  HiHi  heterogeneity  environment,  both 
for  the  inconsistent  and  the  semi-consistent  cases,  because 
these  correspond  to  some  of  the  currently  expected  MSHN 
environments.  A  Poisson  distribution  is  used  to  generate  the 
task  arrivals.  For  each  value  of  |  A'  |,  tasks  are  mapped  under 
two  different  arrival  rates,  Xh  and  A/,  such  that  Xh>Xi.  The 
value  of  Xh  is  chosen  empirically  to  be  high  enough  to  allow 
at  most  50%  tasks  to  complete  when  the  last  task  in  the  set 
arrives.  Similarly,  Xi  is  chosen  to  be  low  enough  to  allow  at 
least  90%  of  the  tasks  to  complete  when  the  last  task  in  the 
set  arrives.  The  MCT  heuristic  is  used  in  this  standardiza¬ 
tion.  Unless  otherwise  stated,  the  task  arrival  rate  is  set  to 
Xh.  Xi  is  more  likely  to  represent  an  HC  system  where  the 
task  arrival  is  characterized  by  little  burstiness;  no  particular 
group  of  tasks  arrives  in  a  much  shorter  span  of  time  than 
some  other  group  having  same  number  of  tasks.  X^  is  sup¬ 
posed  to  characterize  the  arrivals  in  an  HC  system  where 
a  large  group  of  tasks  arrives  in  a  much  shorter  time  than 
some  other  group  having  same  number  of  tasks;  e.g.,  in  this 
case  a  burst  of  |  AT  |  tasks. 

Example  comparisons  are  discussed  in  Subsections  6.2 
to  6,4.  Each  data  point  in  the  comparison  charts  is  an  aver¬ 
age  over  50  trials,  where  for  each  trial  the  simulated  actual 
task  execution  times  are  chosen  independently.  More  gen¬ 
eral  conclusions  about  the  heuristics’  performance  is  in  Sec¬ 
tion  7.  Comparisons  for  a  larger  set  of  performance  metrics 
are  given  in  [13]. 

6.2.  Comparisons  of  the  on-line  mode  heuristics 

Unless  otherwise  stated,  the  on-line  mode  heuristics  are 
investigated  under  the  following  conditions.  In  the  KPB 
heuristic,  k  is  equal  to  20%.  This  particular  value  of  k  was 
found  to  give  the  lowest  makespan  for  the  KPB  heuristic 
under  the  conditions  of  the  experiments.  For  the  SA,  the 
lower  threshold  and  the  upper  threshold  for  the  load  balance 
index  are  0.6  and  0.9,  respectively.  Once  again  these  values 
were  found  to  give  optimum  values  of  makespan  for  the  S  A. 

In  Figure  4,  on-line  mode  heuristics  are  compared  based 
on  makespan  for  inconsistent  HiHi  heterogeneity.  From 
Figure  4,  it  can  be  noted  that  the  KPB  heuristic  completes 
the  execution  of  the  last  finishing  task  earlier  than  the  other 
heuristics  (however,  it  is  only  slightly  better  than  the  MCT). 
For  k  =  20%  and  m  =  20,  the  KPB  heuristic  forces  a  task 
to  choose  a  machine  from  a  subset  of  four  machines.  These 
four  machines  have  the  lowest  execution  times  for  the  given 
task.  The  chosen  machine  would  give  the  smallest  comple¬ 
tion  time  as  compared  to  other  machines  in  the  set. 

Figure  5  compares  the  on-line  mode  heuristics  using  av¬ 
erage  sharing  penalty.  Once  again,  the  KPB  heuristic  per- 
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forms  best.  However,  the  margin  of  improvement  is  smaller 
than  that  for  the  makespan.  It  is  evident  that  the  KPB  pro¬ 
vides  maximum  throughput  (system  oriented  performance 
metric)  and  minimum  average  sharing  penalty  (application 
oriented  performance  metric). 
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Figure  4.  Makespan  for  the  on-line  heuristics  for 
inconsistent  HiHi  heterogeneity. 
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Figure  5.  Average  sharing  penalty  of  the  on-line 
heuristics  for  inconsistent  HiHi  heterogeneity. 

Figure  6  compares  the  makespans  of  the  different  on-line 
heuristics  for  semi-consistent  HiHi  ETC  matrices.  Figure  7 
compares  the  average  sharing  penalties  of  the  different  on¬ 
line  heuristics.  As  shown  in  Figures  4  and  6  the  relative 
performance  of  the  different  on-line  heuristics  is  impacted 
by  the  degree  of  consistency  of  the  ETC  matrices. 

For  the  semi-consistent  type  of  heterogeneity,  machines 
within  a  particular  subset  perform  tasks  that  lie  within  a  par¬ 
ticular  subset  faster  than  other  machines.  From  Figure  6,  it 
can  be  observed  that  for  semi-consistent  ETC  matrices,  the 


MET  heuristic  performs  the  worst.  For  the  semi-consistent 
matrices  used  in  these  simulations,  the  MET  heuristic  maps 
half  of  the  tasks  to  the  same  machine,  considerably  increas¬ 
ing  the  load  imbalance.  Although  the  KPB  also  considers 
only  the  fastest  four  machines  for  each  task  for  the  particu¬ 
lar  value  of  k  used  here  (which  happen  to  be  the  same  four 
machines  for  half  of  the  tasks),  the  performance  does  not 
differ  much  from  the  inconsistent  HiHi  case.  Additional  ex¬ 
periments  have  shown  that  the  KPB  performance  is  quite 
insensitive  to  values  of  k  as  long  as  k  is  larger  than  the  mini¬ 
mum  value  (where  the  KPB  heuristic  is  reduced  to  the  MET 
heuristic).  For  example,  when  k  is  doubled  from  its  min¬ 
imum  value  of  5,  the  makespan  decreases  by  a  factor  of 
about  5.  However  a  further  doubling  of  k  brings  down  the 
makespan  by  a  factor  of  only  about  1.2. 
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Figure  6.  Makespan  of  the  on-line  heuristics  for 
semi-consistent  HiHi  heterogeneity. 


6.3,  Comparisons  of  the  batch  mode  heuristics 

Figures  8  and  9  compare  the  batch  mode  heuristics  based 
on  makespan  and  average  sharing  penalty,  respectively.  In 
these  comparisons,  unless  otherwise  stated,  the  regular  time 
interval  strategy  is  employed  to  schedule  meta-task  map¬ 
ping  events.  The  time  interval  is  set  to  10  seconds.  This 
value  was  empirically  found  to  optimize  makespan  over 
other  values.  From  Figure  8,  it  can  be  noted  that  the  Suf- 
ferage  heuristic  outperforms  the  Min-min  and  the  Max-min 
heuristics  based  on  makespan  (although,  it  is  only  slightly 
better  than  the  Min-min).  However,  for  average  sharing 
penalty,  the  Min-min  heuristic  outperforms  the  other  heuris¬ 
tics  (Figure  9).  The  Sufferage  heuristic  considers  the  “loss” 
in  completion  time  of  a  task  if  it  is  not  assigned  to  its  first 
choice,  in  making  the  mapping  decisions.  By  assigning 
their  first  choice  machines  to  the  tasks  that  have  the  highest 
sufferage  values  among  all  contending  tasks,  the  Sufferage 
heuristic  reduces  the  overall  completion  time. 
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Figure  7.  Average  sharing  penalty  of  the  on-line  Figure  9.  Average  sharing  penalty  of  the  batch 
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Figure  8.  Makespan  of  the  batch  heuristics  for 
the  reguiar  time  intervai  strategy  and  inconsis¬ 
tent  HiHi  heterogeneity. 

Furthermore,  it  can  be  noted  that  the  makespan  given 
by  the  Max-min  is  much  larger  than  the  makespans  ob¬ 
tained  by  the  other  two  heuristics.  Using  reasoning  simi¬ 
lar  to  that  given  in  Subsection  4.3  for  explaining  better  ex¬ 
pected  performance  for  the  Min-min,  it  can  be  seen  that 
the  Max-min  assignments  change  the  machine  ready  time 
status  by  a  larger  amount  than  the  Min-min  assignments 
do.  (The  Sufferage  also  does  not  necessarily  schedule  the 
tasks  that  finish  later  first.)  If  tasks  ti  and  4  are  contending 
for  a  particular  machine  mj,  then  the  Max-min  assigns  mj 
to  the  task  (say  ti)  that  will  increase  the  ready  time  of  mj 
more.  This  decreases  the  probability  that  4  will  still  have 
its  earliest  completion  time  on  mj  and  shall  be  assigned  to 
it.  In  general,  the  percentage  of  tasks  assigned  their  first 


choice  is  likely  to  be  lower  for  the  Max-min  than  for  other 
batch  mode  heuristics.  It  might  be  expected  that  a  larger 
makespan  will  result  if  a  larger  number  of  tasks  is  assigned 
to  the  machines  that  do  not  have  the  best  execution  times 
for  those  tasks. 

Figure  10  compares  the  makespan  of  the  batch  mode 
heuristics  for  semi-consistent  HiHi  heterogeneity.  The  com¬ 
parison  of  the  same  heuristics  for  the  same  parameters  is 
shown  in  Figure  1 1  with  respect  to  average  sharing  penalty. 
Results  for  both  average  sharing  penalty  and  makespan  for 
semi-consistent  HiHi  are  similar  to  those  for  inconsistent 
HiHi. 

The  impact  of  aging  on  batch  mode  heuristics  is  shown 
in  Figures  12  and  13.  From  Figures  12  and  13,  three  obser¬ 
vations  are  in  order.  First,  the  Max-min  heuristic  benefits 
most  from  the  aging  scheme.  Second,  the  makespan  and 
the  average  sharing  penalty  given  by  the  Sufferage  heuris¬ 
tic  change  negligibly  when  aging  scheme  is  applied.  Third, 
even  though  aging  schemes  are  meant  to  reduce  starvation 
of  tasks  (as  gauged  by  average  sharing  penalty),  they  also 
reduce  the  makespan. 

The  fact  that  the  Max-min  benefits  most  from  the  aging 
scheme  can  be  explained  using  the  reasoning  given  in  the 
discussion  on  starvation  in  Subsection  4.3.  The  larger  the 
number  (say  Nnew)  of  newly  arriving  tasks  between  the  map¬ 
ping  events  i;  and  T/+i  ,  the  larger  the  probability  that  some 
of  the  tasks  mapped  at  mapping  event  T/,  or  earlier,  will 
be  starved  (due  to  more  competing  tasks).  The  Max-min 
heuristic  schedules  tasks  that  finish  later  first.  As  mapping 
events  are  not  scheduled  if  machines  are  busy,  two  succes¬ 
sive  mapping  events  in  the  Max-min  are  likely  to  be  sepa¬ 
rated  by  a  larger  time  duration  than  those  in  the  Sufferage 
or  the  Min-min,  The  value  of  Nnew  is  therefore  likely  to 
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be  larger  in  the  Max-min  schedules,  and  starvation  is  more 
likely  to  occur.  Consequently,  aging  schemes  would  make 
greater  difference  to  the  Max-min  schedules:  the  tasks  that 
finish  sooner  are  much  more  likely  to  be  scheduled  before 
the  tasks  that  finish  later  in  the  Max-min  with  aging  than  in 
the  Max-min  without  aging.  In  contrast  to  the  Max-min  (or 
the  Min-min)  operation,  the  Sufferage  heuristic  optimizes  a 
machine  assignment  only  over  the  tasks  that  are  contending 
for  that  particular  machine.  This  reduces  the  probability  of 
competition  between  the  “older”  tasks  and  the  new  arrivals, 
which  in  turn  reduces  the  need  for  an  aging  scheme,  or  the 
improvement  in  schedule  in  case  aging  is  implemented. 

Figures  14,  15,  16,  and  17  show  the  results  of  repeating 
the  above  experiments  with  a  batch  count  mapping  strategy 
for  a  batch  size  of  40.  This  particular  batch  size  was  found 
to  give  an  optimum  value  of  the  makespan.  Figure  14  com¬ 
pares  regular  time  interval  strategy  and  fixed  count  strategy 
on  the  basis  of  makespans  given  by  different  heuristics  for 
inconsistent  HiHi  heterogeneity.  In  Figure  15,  the  average 
sharing  penalties  of  the  same  heuristics  for  the  same  pa¬ 
rameters  are  compared.  It  can  be  seen  that  the  fixed  count 
approach  gives  essentially  the  same  results  for  the  Min-min 
and  the  Sufferage  heuristics.  The  Max-min  heuristic,  how¬ 
ever,  benefits  considerably  from  the  fixed  count  approach; 
makespan  drops  to  about  60%  for  |  /iT  |=  1000,  and  to  about 
50%  for  I  AT  1=  2000  as  compared  to  the  makespan  given 
by  the  regular  time  interval  strategy.  A  possible  explanation 
lies  in  a  conceptual  element  of  similarity  between  the  fixed 
count  approach  and  the  aging  scheme.  A  “good”  value  of 
K  in  fixed  count  strategy  is  neither  too  small  to  allow  only 
a  limited  optimization  of  machine  assignment  nor  too  large 
to  subject  the  tasks  carried  over  from  the  previous  mapping 
events  to  a  possibly  defeating  competition  with  the  new  or 
recent  arrivals.  Figures  16  and  17  show  the  makespan  and 
the  average  sharing  penalty  given  for  the  semi-consistent 
case.  These  results  show  that,  for  the  Sufferage  and  the 
Min-min,  the  regular  time  interval  approach  gives  slightly 
better  results  than  the  fixed  count  approach.  For  the  Max- 
min,  however,  the  fixed  count  approach  gives  better  perfor¬ 
mance. 

6.4.  Comparing  on-line  and  batch  heuristics 

In  Figure  18,  two  on-line  mode  heuristics,  the  MCT  and 
the  KPB,  are  compared  with  two  batch  mode  heuristics,  the 
Min-min  and  the  Sufferage.  The  comparison  is  performed 
with  Poisson  arrival  rate  set  to  X/^.  It  can  be  noted  that  for 
the  higher  arrival  rate  and  larger  |  /iT  |,  batch  heuristics  are 
superior  to  on-line  heuristics.  This  is  because  the  number 
of  tasks  waiting  to  begin  execution  is  likely  to  be  larger  in 
above  circumstances  than  in  any  other,  which  in  turn  means 
that  rescheduling  is  likely  to  improve  many  more  mappings 
in  such  a  system.  The  on-line  heuristics  consider  only  one 
task  when  they  try  to  optimize  machine  assignment,  and  do 
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Figure  10.  Makespan  of  the  batch  heuristics 
for  the  reguiar  time  intervai  strategy  and  semi- 
consistent  HiHi  heterogeneity. 
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Figure  1 1 .  Average  sharing  penaity  of  the  batch 
heuristics  for  the  reguiar  time  interval  strategy 
and  semi-consistent  HiHi  heterogeneity. 


1000  2000 
number  of  tasks 


1000  2000 
number  of  tasks 


not  reschedule.  Recall  that  the  mapping  heuristics  use  a 
combination  of  expected  and  actual  task  execution  times 
to  compute  machine  ready  times.  The  on-line  heuristics 
are  likely  to  approach  the  performance  of  the  batch  ones  at 
low  task  arrival  rates,  because  then  both  classes  of  heuris¬ 
tics  have  comparable  information  about  the  actual  execution 
times  of  the  tasks.  For  example,  at  a  certain  low  arrival  rate, 
the  100-th  arriving  task  might  find  that  70  previously  arrived 
tasks  have  completed.  At  a  higher  arrival  rate,  only  20  tasks 
might  have  completed  when  the  100-th  task  arrived.  The 
above  observation  is  borne  out  in  Figure  19,  which  shows 
that  the  relative  performance  difference  between  on-line  and 
batch  heuristics  decreases  with  a  decrease  in  arrival  rate. 
Given  the  observation  that  the  KPB  and  the  Sufferage  per- 
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Figure  12.  Makespan  for  the  batch  heuristics  for 
the  regular  time  interval  strategy  with  and  with¬ 
out  aging  for  Inconsistent  HiHi  heterogeneity. 
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Figure  13.  Average  sharing  penalty  of  the  batch 
heuristics  for  the  regular  time  interval  strategy 
with  and  without  aging  for  inconsistent  HiHi  het¬ 
erogeneity. 

form  almost  similarly  at  this  low  arrival  rate,  it  might  be 
better  to  use  the  KPB  heuristic  because  of  its  smaller  com¬ 
putation  time.  Moreover,  Figures  18  and  19  show  that  the 
makespan  values  for  all  heuristics  are  larger  for  lower  ar¬ 
rival  rate.  This  is  attributable  to  the  fact  that  at  lower  arrival 
rates,  a  larger  fraction  of  a  task’s  completion  time  is  deter¬ 
mined  by  its  beginning  time, 

7.  Conclusions 

New  and  previously  proposed  dynamic  matching  and 
scheduling  heuristics  for  mapping  independent  tasks  onto 
HC  systems  were  compared  under  a  variety  of  simulated 
computational  environments.  Five  on-line  mode  heuristics 


Figure  14.  Comparison  of  the  makespans  given 
by  the  fixed  count  mapping  strategy  and  the  reg¬ 
ular  time  interval  strategy  for  inconsistent  HiHi 
heterogeneity. 
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Figure  15.  Comparison  of  the  average  sharing 
penalty  given  by  the  fixed  count  mapping  strat¬ 
egy  and  the  regular  time  interval  strategy  for 
inconsistent  HiHi  heterogeneity. 


and  three  batch  mode  heuristics  were  studied. 

In  the  on-line  mode,  for  both  the  semi-consistent  and  the 
inconsistent  types  of  HiHi  heterogeneity,  the  KPB  heuris¬ 
tic  outperformed  the  other  heuristics  on  all  performance 
metrics  (however,  the  KPB  was  only  slightly  better  than 
the  MCT).  The  average  sharing  penalty  gains  were  smaller 
than  the  makespan  ones.  The  KPB  can  provide  good  sys¬ 
tem  oriented  performance  (e.g.,  minimum  makespan)  and 
at  the  same  time  provide  good  application  oriented  per¬ 
formance  (e.g.,  low  average  sharing  penalty).  The  rela¬ 
tive  performance  of  the  OLB  and  the  MET  with  respect  to 
the  makespan  reversed  when  the  heterogeneity  was  changed 
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Figure  16.  Comparison  of  the  makespan  given 
by  the  fixed  count  mapping  strategy  and  the  reg¬ 
ular  time  interval  strategy  for  semi-consistent 
HiHi  heterogeneity. 


Figure  18.  Comparison  of  the  makespan  given 
by  batch  heuristics  (reguiar  time  intervai  strat¬ 
egy)  and  on-line  heuristics  for  inconsistent  HiHi 
heterogeneity  and  an  arrival  rate  of  Xh. 
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Figure  17.  Comparison  of  the  average  sharing 
penalty  given  by  the  fixed  count  mapping  strat¬ 
egy  and  the  regular  time  interval  strategy  for 
semi-consistent  HiHi  heterogeneity. 


Figure  19.  Comparison  of  the  makespan  given 
by  batch  heuristics  (regular  time  interval  strat¬ 
egy)  and  on-line  heuristics  for  inconsistent  HiHi 
heterogeneity  and  an  arrival  rate  of  X/. 


from  the  semi-consistent  to  the  inconsistent.  The  OLB  did 
better  than  the  MET  for  the  semi-consistent  case. 

In  the  batch  mode,  for  the  semi-consistent  and  the  in¬ 
consistent  types  of  HiHi  heterogeneity,  the  Min-min  heuris¬ 
tic  outperformed  the  Sufferage  and  Max-min  heuristics  in 
the  average  sharing  penalty.  However,  the  Sufferage  per¬ 
formed  the  best  with  respect  to  makespan  for  both  the  semi- 
consistent  and  the  inconsistent  types  of  HiHi  heterogene¬ 
ity  (though,  the  Sufferage  was  only  slightly  better  than  the 
Min-min). 

The  batch  heuristics  are  likely  to  give  a  smaller 
makespan  than  the  on-line  ones  for  large  |  K  \  and  high  task 


arrival  rate.  For  smaller  values  of  |  |  and  lower  task  ar¬ 

rival  rates,  the  improvement  in  makespan  offered  by  batch 
heuristics  is  likely  to  be  nominal. 

This  study  quantifies  how  the  relative  performance 
of  these  dynamic  mapping  heuristics  depends  on  (a)  the 
consistency  property  of  the  ETC  matrix,  (b)  the  require¬ 
ment  to  optimize  system  oriented  or  application  oriented 
performance  metrics  (e.g.,  optimizing  makespan  versus 
optimizing  average  sharing  penalty),  and  (c)  the  arrival 
rate  of  the  tasks.  Thus,  the  choice  of  the  heuristic  which  is 
best  to  use  will  be  a  function  of  such  factors.  Therefore, 
it  is  important  to  include  a  set  of  heuristics  in  a  resource 
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management  system  for  HC  environments,  and  then  use 
the  heuristic  that  is  most  appropriate  for  a  given  situa¬ 
tion  (as  will  be  done  in  the  Scheduling  Advisor  for  MSHN). 
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Abstract 

Wfe  present  a  new  software  technology  for  on-line  per¬ 
formance  analysis  and  visualization  of  complex  parallel 
and  distributed  systems.  Often  heterogeneous,  these  sys¬ 
tems  need  capabilities  for  flexible  integration  and  config¬ 
uration  of  performance  analysis  and  visualization.  Our 
technology  is  based  on  an  object-oriented  framework  for 
rapid  prototyping  and  development  of  distributable  vi¬ 
sual  objects.  The  visual  objects  consist  of  two  levels,  a 
platform/device-specific  low  level,  and  an  analysis-  and 
visualization- specific  high  level.  We  have  developed  a  very 
high-level,  markup  language,  called  VOML,  and  a  com¬ 
piler  for  component-based  development  of  high-level  visual 
objects.  The  VOML  is  based  on  a  software  architecture 
for  on-line  event  processing  and  performance  visualization 
called  EPIRA.  The  technology  lends  itself  to  constructing 
high-level  visual  objects  from  globally  distributed  compo¬ 
nent  definitions.  We  present  details  of  the  technology  and 
tools  used,  and  show  how  an  example  visual  object  can  be 
rapidly  prototyped  from  several  reusable  components. 


1  Introduction 

Performance  analysis  and  visualization  (PAV)  tools  are 
crucial  components  of  an  effective  development  cycle,  as 
well  as  deployment,  of  parallel  and  distributed  applica¬ 
tions.  On-line  PAV  is  even  becoming  necessary  for  the 
latter.  Since  the  amount  of  performance  data  to  be  analy¬ 
zed  and  visualized  increases  with  the  size  of  a  target  par¬ 
allel/distributed  application,  on-line  PAV  itself  should  be 
distributed.  Heterogeneous  systems,  in  addition,  need  PAV 

*This  work  was  supported  in  part  by  DARPA  contract  No.  DABT  63- 
95_C-0072,  NSF  grant  No.  CDA-9529488.  and  NSF  grant  No.  ASC- 
9624149. 


tools  that  provide  flexible  integration  and  configuration  sup¬ 
port  for  heterogeneous  performance  data.  Extant  generic 
and  library-specific  PAV  tools  for  parallel/distributed  sys¬ 
tems  can  cover  only  low-level  performance  aspects,  pro¬ 
vided  that  the  target  systems  fit  into  their  generic  schemes 
and/or  use  specific  libraries,  such  as  PVM  [8]  and  MPI  [7]. 
A  wider  range  of  performance  aspects,  at  multiple  levels, 
global  and  local,  are  needed  to  capture  and  visually  explain 
the  behavior  of  a  heterogeneous  system. 

We  have  developed  a  framework  for  on-line  PAV,  called 
PG^^ visual  objects\  to  address  these  issues.  The  frame¬ 
work  is  object-oriented  and  easily  distributable  via  middle¬ 
ware  software  such  as  CORE  A  [16]  and  DCOM  [3].  Within 
it,  a  visual-object  developer  can  integrate  low-  and  high- 
level,  application-specific  PAV  Furthermore,  it  is  based  on 
two  visual-object  levels  for  portability  and  code  reuse:  a 
device-dependent  low  level,  and  a  device-independent  high- 
level.  Our  goal  was  also  to  be  able  to  integrate  various 
sources  of  off-  and  on-line  performance  data.  To  achieve 
this  flexibility,  the  visual  objects  consume  performance  data 
in  the  form  of  event  records  from  an  environment.  To  for¬ 
malize  the  design  of  high-level  visual  objects,  i.e.,  enforce  a 
structured  approach  that  is  less  error-prone,  we  have  defined 
certain  rules  and  a  very  high-level,  component-based  speci¬ 
fication  language,  called  Visual  Object  Markup  Language 
(VOML).  The  language  uses  Standard  Generalized  Markup 
Language  (SGML)  markup  for  structuring  visual  objects, 
and  Scheme  scripts  for  defining  PAV  semantics. 

The  use  of  SGML  enables  development  of  a  PAV  in¬ 
formation  infrastructure  for  platform-  and  tool-independent 
development  of  visual  objects.  It  may  also  facilitate  au¬ 
tomatic  monitoring,  analysis,  and  visualization  of  globally 
distributed  applications  via  network-enabled  SGML  entity 
managers.  The  use  of  Scheme  for  visual  object  semantics 
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enables  both  rapid  prototyping  of  visual  objects  and  cus¬ 
tomizing  VOs  for  a  wide  range  of  platforms  via,  for  exam¬ 
ple,  Scheme-to-C  and  Scheme-to-Java  VM  bytecode  com¬ 
pilers.  That  is,  a  single  VOML  specification  may  be  used  to 
generate  automatically  an  X  library-based  visual  object  and 
one  that  runs  within  a  WWW  browser. 

In  Section  2,  we  describe  the  visual-object  framework  in 
detail,  and  show  an  example  of  successful  use  for  PAV  of  a 
distributed  multimedia  real-time  application.  A  PAV  archi¬ 
tecture  for  high-level  visual  objects,  the  markup  language 
based  on  it,  and  development  environment  are  presented  in 
Section  3.  An  example  of  a  VOML  specification  is  given  in 
Section  4.  We  compare  our  PAV  approach  to  other  work  in 
the  area  in  Section  5,  and  conclude  in  Section  6. 

2  Visual  Object  Architecture 

The  Visual  Object  (VO)  architecture  identifies  two  main 
software  layers  apparent  in  the  majority  of  extant  PAV 
tools,  and  represents  them  as  two  classes:  the  high-level 
VO  (HLVO)  class  and  the  low-level  VO  (LLVO)  class.  In 
general,  the  responsibility  of  an  HLVO  class  is  to  imple¬ 
ment  an  application-specific  semantics,  while  an  LLVO 
class  is  platform-dependent  while  providing  a  platform- 
independent  interface  to  the  HLVO  class.  When  implement¬ 
ing  a  VO  class,  an  HLVO  class  implementation  is  derived 
from  an  LLVO  class  implementation,  as  shown^  in  Figure  1 . 
In  the  following  subsections,  we  describe  main  characteris¬ 
tics  of  the  LLVO  and  HLVO  class,  and  show  an  application 
to  a  heterogeneous  system. 

2.1  Lovt^-level  visual  object 

The  responsibilities  of  an  LLVO  class  described  below 
illustrate  the  basic  building  block  of  our  PAV  technology. 
They  have  evolved  by  repeat  of  substantial  experimentation 
with  an  X  library-based  two-dimensional  LLVO  class  that 
we  have  implemented  in 

Multiple  views.  An  LLVO  maintains  a  number  of  display 
areas,  referred  to  as  views.  In  our  implementation  of 
the  LLVO  class,  each  display  area  is  supported  by  a 
contained  object  that  maintains  the  state  of  the  corre¬ 
sponding  X  window. 

Graphical  primitives.  The  LLVO  class  provides  methods 
for  rendering  simple  graphical  objects,  text  and  fig¬ 
ures  in  the  views.  The  coordinate  system  used  for  the 
graphical  objects’  representative  coordinates  (as  argu¬ 
ments  to  the  methods)  is  a  world  coordinate  system 
specified  by  the  user  at  the  moment  of  (re)initializing 
a  view. 

^Vertical  bars  in  a  high-level  method  denote  the  presence  of  multiple 
peer  components. 


Figure  1.  The  design  of  a  visual  object 


Display  area.  A  view  consists  of  an  internal  area  sur¬ 
rounded  by  margins,  referred  to  as  scrollable  area.  As 
a  visualization  progresses,  the  mapping  from  the  world 
coordinate  system  to  the  view  coordinate  system  may 
change,  at  which  point  only  the  contents  of  the  scrol¬ 
lable  area  may  be  translated  or  rescaled  (zoomed)  as  a 
response. 

Control  methods.  Methods  such  as  scroll,  resize,  rescale 
and  snapshot  provide  explicit  control  over  each  view. 
Combined  with  the  graphical  primitives,  they  allow  an 
HLVO  to  control  explicitly,  among  other  things,  what 
to  be  drawn  and  what  to  be  visible  at  a  point  in  time. 

Quantitative  adaptation.  A  relation  between  a  view  and 
calls  to  graphical  primitives  that  draw  in  the  view  may 
be  established  that  causes  the  view  to  adapt  dynami¬ 
cally  by  translating  or  rescaling  (zooming)  the  contents 
of  the  scrollable  area,  thus  implicitly  controlling  what 
should  be  visible  over  an  interval  of  time. 
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Qualitative  adaptation.  The  LLVO  class  may  be  portable 
to  multiple  graphical  platforms  that  differ  at  some  ex¬ 
tent  (e.g.,  different  X  servers  may  use  different  color 
maps).  At  run  time,  it  may  adapt  to  the  platform  capa¬ 
bilities,  as  well  as  provide  drawing  optimization. 

An  LLVO  class  implementation  may  perform  book¬ 
keeping  about  graphical  objects  being  drawn  and/or  be 
based  on  vector  graphics,  in  order  to  facilitate  quantitative 
and/or  qualitative  adaptation.  However,  this  is  not  manda¬ 
tory  and  an  HLVO  class  implementation  can  only  assume 
that  the  underlying  LLVO  class  is  memoryless  and  raster- 
based. 

The  quantitative  adaptation  of  a  view  in  our  imple¬ 
mentation  is  initialized  by  specifying  directions  (from 
{rc-hjX— })  and  types  of  adaptation  (rescaling  or 
scrolling)  for  these  directions.  On  the  other  hand,  one  of 
the  parameters  of  every  graphical  primitive  is  the  adapta¬ 
tion  flag  that  determines  whether  the  view  should  adapt  be¬ 
fore  the  graphical  object  is  drawn,  in  order  for  the  graphical 
object  to  be  visible.  In  the  case  of  rescaling,  the  view  may 
also  adapt  in  the  opposite  way.  For  example,  if  the  view  had 
to  rescale  “down”  (zoom  out)  in  response  to  a  peak  in  a  tem¬ 
poral  line  plot^,  it  will  rescale  “up”  (zoom  in)  once  the  peak 
has  disappeared  from  the  scrollable  area.  Another  parame¬ 
ter  for  the  initialization  is  the  adaptation  quality.  We  use  it 
to  specify  the  maximum  size  of  data  structures  (in  our  im¬ 
plementation,  interconnected  red-black  trees  [6])  used  to  re¬ 
member  extreme  points  of  graphical  objects  that  have  been 
drawn  with  the  adaptation  flag  set. 

2.2  High-level  visual  object 

Similarly  as  for  the  LLVO  class  in  general,  we  give  im¬ 
plementors  freedom  to  define  a  precise  framework  for  de¬ 
veloping  HLVOs.  In  this  section,  describe  our  HLVO  imple¬ 
mentation  base,  and  in  Section  3  we  present  an  HLVO  de¬ 
velopment  framework.  The  main  components  of  an  HLVO 
class  are  the  four  methods  shown  in  Figure  1. 

Event  processing.  The  performance  data  passed  to  an 
HLVO  via  calls  to  the  processing  method  are  termed 
events  (or  data  events).  Based  on  the  events,  this 
method  (1)  updates  performance  information  referred 
to  as  info  structures,  and  (2)  controls  the  rendering  of 
this  information  by  updating  data  structures  referred  to 
as  control  structures. 

Information  rendering.  The  rendering  method  may  be 
called,  to  map  a  portion  of  the  info  structures’  contents 
to  the  LLVO  views,  either  immediately  after  process¬ 
ing  an  event  (asynchronous  rendering  mode)  or  by  a 

^The  contents  of  the  scrollable  area  is  scrolled  to  the  left  as  the  time 
progresses. 


thread  that  may  synchronize  the  rendering  of  multiple 
HLVOs  (synchronous  rendering  mode).  This  method 
communicates  with  the  processing  method  by  both 
reading  and  writing  the  control  structures. 

Callback  processing.  An  HLVO  may  also  respond  to 
changes  in  its  run-time  environment,  as  well  as  to  the 
user’s  commands.  This  method  may,  for  example, 
preprocess  callback  events  coming  from  the  LLVO, 
a  GUI,  etc.,  and  then  forward  them  to  the  processing 
method  as  if  they  were  data  events. 

(Re)initialization.  In  on-line  performance  visualization,  it 
is  desirable  to  be  able  to  reinitialize  partially  or  recon¬ 
figure  an  HLVO  without  interrupting  the  target  appli¬ 
cation  and/or  instrumentation  system  that  supplies  per¬ 
formance  data. 

In  order  to  allow  for  rapid  prototyping  of  HLVOs  and  fur¬ 
ther  PAV  research,  we  have  developed  a  framework  based 
on  an  implementation  of  the  Scheme  language  [5]  called 
GUILE  [14].  A  generic  HLVO  class  inherits  the  X  library- 
based  LLVO  class.  Both  classes  have  some  methods  and 
data  wrapped  by  Scheme  procedures  within  a  (run-time) 
tool  integration  environment  for  instrumentation  and  per¬ 
formance  visualization,  called  PG^^-TIE  [1].  The  GUI  of 
a  VO  (in  addition  to  LLVO  callbacks)  is  implemented  sep¬ 
arately,  using  a  GUILE  interface  to  Tk  [25].  It  is  possible 
in  Scheme,  as  a  dynamically-typed  language,  for  the  event 
processing  method  to  receive  any  type  of  data  structure  as 
an  event,  which  allows  for  easy  integration  of  different  per¬ 
formance  data  sources.  Most  importantly,  this  high-level 
algorithmic  language  is  suitable  for  easy  definition  of  com¬ 
plex  info  structures  (e.g.,  association  lists  serving  as  micro¬ 
databases)  and  compact  expression  of  updating  and  query¬ 
ing  them  in  the  event  processing  and  information  rendering 
methods,  respectively.  We  have  developed  a  CORBA  in¬ 
terface  for  this  framework  so  that  a  PAV  application  may 
consist  of  VOs  distributed  over  multiple  nodes,  while  a 
Scheme-to-C  compiler  [22]  is  also  available  that  may  speed 
up  the  HLVO  code. 

2.3  Application  of  visual  objects  to  a  heteroge¬ 
neous  system 

As  part  of  the  environment,  two  prototype  visual 
objects  have  been  applied  to  the  study  of  a  distributed  mul¬ 
timedia  real-time  application.  The  target  system  consisted 
of  a  server  and  a  number  of  heterogeneous  receivers  of  mul¬ 
timedia  data  streams.  The  visual  objects  helped  determine 
the  processing  demands  required  to  playback  different  pat¬ 
terns  of  video  frames  and  to  handle  different  sizes  of  video 
frames,  as  well  as  the  wasted  computation  due  to  receiving 
video  frames  that  cannot  be  replayed  due  to  time  constraints 
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Figure  2.  On-line  performance  visualization  of  the  real-time  multimedia  application 


(e.g.,  a  new  video  frame  arrives  before  an  older  video  frame 
can  be  processed).  Furthermore,  they  helped  understand  the 
operation  of  the  application:  variations  in  periodic  behavior, 
and  specific  points  in  a  network  where  frame  loss  occurs, 
either  due  to  network  congestion  or  individual  workstations 
loading  conditions,  were  displayed.  Immense  amount  of 
state  information,  condensed  into  a  set  of  visual  displays, 
could  be  used  by  the  feedback  control  algorithm  to  make 
decisions  automatically  about  target  bandwidth  being  re¬ 
quested  of  the  video  source. 

The  snapshot  of  one  visual  object  is  given  in  Figure  2. 
Its  four  views  show  (1)  the  throughput  of  received  and  es¬ 
timated  throughput  of  lost  video  data,  (2)  the  frame  rate, 
(3)  the  frequency  distributions  of  received  and  lost  video 
frames  over  one- second  intervals,  and  (4)  a  spatial,  ani¬ 
mated  view  of  all  receivers  and  their  connections,  depicting 
the  relative  volumes  of  received,  lost,  used,  and  dropped 
video  packets.  The  other  visual  object  has  16  views,  di¬ 
vided  into  four  groups:  (1)  CPU  utilization,  (2)  the  period¬ 
icity  of  video  frames  received,  the  number  of  received  ATM 
cells,  and  (4)  the  number  of  lost  ATM  cells.  In  each  group. 


there  are  four  related  views  of  the  corresponding  metric:  (1) 
minimum-average-maximum,  (2)  sample  deviation,  (3)  ag¬ 
gregate,  and  (4)  per-receiver  histogram, 

3  Visual  Object  Markup  Language  (VOML) 

In  this  section,  we  describe  a  framework  for  semi¬ 
automatic  design  and  prototyping  of  HLVOs.  We  first  de¬ 
fine  a  generic  architecture  for  event  processing  and  per¬ 
formance  information  rendering  that  is  orthogonal'*  to  the 
VO  architecture  described  in  Section  2.  Next  we  present 
salient  characteristics  of  a  very  high-level  language  based 
on  this  architecture,  called  Visual  Object  Markup  Language 
(VOML),  and  its  compiler  that  we  have  designed  and  im¬ 
plemented.  The  VOML  system  allows  a  performance  vi¬ 
sualization  developer  to  concentrate  on  application-  and 
visualization-specific  semantics  and  build  HLVOs  by  com¬ 
bining  reusable  components. 

^In  this  context,  where  the  two  architectures  coexist,  orthogonal  means 
that  either  architecture  can  be  extended  without  affecting  the  other. 
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Figure  3.  Event  Processing  and  information  Rendering  Architecture  (EPiRA) 
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3.1  Event  Processing  and  Information  Rendering 
Architecture  (EPIRA) 

There  are  many  possible  patterns  for  development  of 
complex  HLVOs.  For  example,  one  could  extend  or  modify 
the  VO  architecture  in  Figure  1  and  build  complex  HLVOs 
in  a  pure  object-oriented  style,  by  inheriting  from  simpler 
HLVOs.  However,  since  our  goal  was  to  develop  a  frame¬ 
work  that  could  be  applicable  to  target  languages  that  do  not 
have  strong  support  for  object  orientation  (e.g..  Scheme  and 
C),  we  have  taken  a  component-based  approach. 

Figure  3  shows  the  Event  Processing  and  Information 
Rendering  Architecture  (EPIRA),  The  architecture  specifies 
the  tentative  parts  of  the  HLVO  architecture,  shown  in  Fig¬ 
ure  1,  and  focuses  on  the  data-driven  computation  aspect. 
The  two  modules  in  the  figure  correspond  to  the  event  pro¬ 
cessing  and  information  rendering  methods.  The  events  ar¬ 
rive  (via  method  calls)  from  the  two  “busses”  on  the  left: 
they  carry  the  performance  data  and  the  changes  in  the  run¬ 
time  environment.  Arrows  are  drawn  to  denote  unidirec¬ 
tional  data  flows. 

The  event  processing  module  may  contain  a  number  of 
event  processing  (EP)  components.  Each  EP  component  in 
turn  may  contain  a  number  of  parts  (separated  by  horizon¬ 
tal  lines  in  the  figure),  belonging  to  one  of  three  classes:  (1) 
event-based  ones,  shown  in  the  middle  and  executed  upon 


arrival  of  a  specific  event,  and  condition-based  ones,  which 
can  be  executed  (2)  before  or  (3)  after  the  event  processing, 
provided  that  a  specific  condition  tests  true^.  Since  only 
one  event  can  be  received  at  a  time,  among  all  event-based 
parts  (belonging  to  different  EP  components)  only  those  for 
the  received  event  are  executed.  The  conditions  correspond¬ 
ing  to  the  condition-based  parts  are  evaluated  each  time  any 
event  is  received. 

Similarly,  the  information  rendering  module  contains  a 
number  of  information  rendering  (IR)  components.  Each 
IR  component  in  turn  contains  two  parts,  or  phases.  Dur¬ 
ing  the  first  phase,  info  and  control  structures  are  analyzed 
and  the  contents  of  the  info  structures  is  rendered  in  mul¬ 
tiple  views  appropriately.  Once  that  all  first-phase  parts  of 
the  IR  components  have  been  executed,  the  execution  of 
second-phase  parts  begins,  when  the  control  structures  may 
be  updated  safely.  Since  the  HLVO  assumes  that  the  LLVO 
has  no  special  rendering  support  (e.g.,  a  depth  buffer),  some 
visualizations  may  depend  on  the  relative  execution  order  of 
the  first-phase  parts. 


^In  VOML,  this  is  referred  to  as  a  “condition  event” 
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3.2  The  VOML  language 

We  have  chosen  SGML  [9]  as  the  basis  for  a  PAV  in¬ 
formation  infrastructure  we  plan  to  build  around  the  VO 
and  EPIRA  architectures.  For  a  start,  the  VOML  is  an 
SGML  document  type  definition  (DTD)  that  encompasses 
the  structure  of  HLVOs  based  on  EPIRA.  Some  of  its 
higher-level  elements  and  example  relations  among  the  ele¬ 
ments  are  given  in  Figures  4a  and  4b. 

As  it  can  be  seen  from  Figure  4b,  VOML  attributes  are 
used  both  to  specify  certain  characteristics  of  software  com¬ 
ponents  described  by  the  elements  and  to  create  relations 
among  them,  some  of  which  directly  correspond  to  the  con¬ 
nections  shown  in  Figure  3.  The  others  are  not  as  “hard¬ 
wired,”  and  are  described  using  Figures  4b  and  5.  Figure  5 
defines  an  example  IR  component  that  is  used  in  a  visual 
object  defined  in  Figure  4b. 

Although  SGML  is  a  very  suitable  tool  for  writing  struc¬ 
tured  specifications,  it  lacks  the  means  for  describing  se¬ 
mantics  of  a  specification.  On  the  other  hand,  Scheme  is 
a  standardized  language  with  simple  syntax  and  clean  se¬ 
mantics  that  is  very  suitable  for  describing  the  semantics 
of  EP  and  IR  components.  Hence,  we  have  decided  to 
imbed  Scheme  into  VOML  markup.  Combining  markup 
and  a  programming  language,  typically  Java  in  WWW- 
related  markup  languages,  is  not  a  new  idea.  However, 
the  integration  of  VOML  and  Scheme  is  tighter,  as  can  be 
seen  from  the  code  example  in  Figure  5.  Unlike  script- 
augmented  HTML  files  that  are  final  documents  to  be  “exe¬ 
cuted,”  VOML  specifications  are  to  be  compiled. 

Namely,  within  Scheme  code  defining  the  semantics  of  a 
component,  there  may  exist  references  to  “formal  parame¬ 
ters:”  info  structures,  control  structures,  events  (in  EP  com¬ 
ponents)  or  views  (in  IR  components).  The  reference  nota¬ 
tion  is  Tn  [  /m] ,  where 

•  T  is  $  for  info  structures,  %  for  control  structures,  and 

for  events  and  views; 

•  n  is  the  position  of  the  formal  parameter  in  the  cor¬ 
responding  parameter  list  (e.g.,  $0  corresponds  to 
current  time  in  the  last  line  of  Figure  4b,  because  it 
is  the  0-th  argument  supplied  via  the  infos  attribute); 

•  optional  /m  is  used  for  referencing  individual  fields 
of  an  event^.  For  example,  an  occurrence  of  ''0/1 
within  code  of  EP  component  onescalarprocess 
in  Figure  4b  would  reference  field  value  of  data  event 
onescalar. 

The  info  and  control  structures  are  translated  into  spe¬ 
cial  global  variables  by  the  compiler.  Effectively,  they  are 

^Currently,  VOML  only  supports  PICL  [26]  compatible  events,  i.e., 
lists  with  the  first  two  elements  being  integers  that  determine  the  record 
and  event  type. 


“passed”  to  EP  and  IR  components  by  reference  when  listed 
in  the  infos  and  controls  attributes  of  the  enclosing 
VOML  element.  In  this  way,  a  reusable  component  may  be 
written,  tested,  and  placed  into  a  library.  In  an  SGML  sys¬ 
tem,  such  components  may  be  kept  as  external  SGML  en¬ 
tities  and  used  in  VOML  specifications  of  different  HLVOs 
by  simply  referencing  them  by  names. 

EP  components  tend  to  be  application-specific,  as 
they  process  application-specific  event  records.  To  make 
them  more  reusable,  the  element  preprocess-inputs 
is  provided  that  allows  for  specifying  “glue  logic”  (as 
Scheme  expressions)  for  data  events  specific  to  a  new 
application.  Namely,  before  an  existing  EP  component 
is  referenced  (i.e.,  used),  any  fields  of  the  data  events  it 
processes  may  be  arbitrarily  preprocessed.  For  example, 
an  EP  component  that  updates  info  structures  for  a  sim¬ 
ple  line-plot  visualization  (e.g..  Scheme  code  just  under 
<ep-component  name= " onescalarprocess " . . . > 
in  Figure  4b,  updating  info  structures  to  be  rendered  by  the 
IR  component  code  in  Figure  5)  can  be  used  to  visualize 
the  frame  rate  of  a  multimedia  application.  The  glue  logic 
in  this  case  could  be  a  function  that  divides  the  number 
of  frames  received  in  a  time  interval^  by  the  length  of  the 
time  interval,  whose  result  would  be  assigned  to  the  second 
field  (named  value  in  the  case  of  the  default  data  event 
onescalar). 

Similarly,  different  library  IR  components  that  are  pa¬ 
rameterized  may  be  combined  in  interesting  ways  over  a 
number  of  views.  Additionally,  they  may  be  given  attributes 
to  determine  their  higher- level  behavior^.  One  such  at¬ 
tribute  is  named  refresh,  which  currently  can  have  any 
combination  of  values  resize,  rescale  and  update. 
If  any  of  the  first  two  values  is  used  for  an  IR  compo¬ 
nent,  the  component  will  redraw  its  contents  if  any  of  the 
views  it  draws  to  get  resized  or  rescaled.  This  is  useful 
for  raster-based  LLVO  class  implementations,  where  resiz¬ 
ing  or  rescaling  an  image  is  lossy.  If  update  is  used,  the 
component  will  undraw  what  it  drew  last  time,  before  pro¬ 
ceeding  to  render  the  contents  of  the  info  structures  again. 
Certain  higher- level  behavior,  which  would  by  default  ig¬ 
nore  any  control  structures,  can  also  be  controlled  by  an 
enable  attribute  that  takes  a  Scheme  expression  evaluat¬ 
ing  to  a  Boolean  value.  When  combining  IR  components, 
the  HLVO  developer  may  define  their  execution  order. 


'^Assume  that  the  number  of  frames  is  contained  in  a  data  event  field, 
and  the  time  interval  is  kept  in  an  info  structure. 

^Currently,  this  behavior  is  supported  in  our  HLVO  implementation  by 
auxiliary  Scheme  code;  it  might  also  be  supported  by  an  optimizing  LLVO 
class. 
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voml 

head 

body 

visual -object 

even t -decl ar a t ions 
data -event 
info-structures 
control -structures 
utility-code 
view- initializations 
view 

event -processing 
ep- component 

preprocess - inputs 
info-rendition 
ir- component 
line 

(a)  Higher-level  elements 


<event-declarations> 

<data-event  name='' ones c alar"  rtype="entry"  etype='’3000"> 
<data-field  name="key"> 

<data- field  name=:"value"> 

<info-structures> 

<  variable  name='*  current  time"  type="real"> 

<variable  name="assoclist"  type="list"> 

<variable  name= "palette"  type="list"> 

<control-structures> 

<variable  name="beepcount "  type=’'int"  init="0"> 

<utility-code> 

{define  (beep) 

(display  #\Bel)) 

</utility-code> 

<view-initializations> 

<view  name="lineplotview"  title= "Multi-scalar  line-plot" ... > 

<  event -processing> 

<ep- component  name= " onescalarprocess "  input s = " ones cal ar . key . value" 
infos="currenttime  assoclist"> 

< in  f o - r endi t i on> 

<ir-component  name="lineplot render"  views="lineplotview" 

infos="currenttime  assoclist  palette"  controls= "beepcount "> 

(b)  Relations  among  elements  of  a  VOML  specification 


Figure  4.  A  brief  description  of  VOML 


<description> 

This  IR  component  draws  a  line-plot  of  multiple  scalars  over  time,  in  the  supplied 
view  ("0)  .  Only  lines  with  the  last-update  time  equal  to  the  current  time  are  drawn - 
Once  10  lines  have  been  drawn,  a  short  sound  (beep)  is  generated. 

The  info  structures  consist  of  the  current  time  ($0,  non-negative  real  number) 
a  multi-scalar  association  list  ($1,  indexed  by  non-negative  integer  keys), 
and  a  color  palette  ($2,  a  list  of  strings  --  color  names). 

Each  value  in  the  association  list  is  a  4 -element  vector: 

# (old-time  old-value  new-time  new-value) . 

The  key  of  each  value  in  the  multi-scalar  association  list  is  used  to  index  the 
color.  When  all  colors  are  exhausted,  the  line  thickness  is  increased  to  distinguish 
between  different  scalars.  A  counter  is  used  as  a  control  structure  (%0)  for 
generating  sounds . 

</description> 

(let  ( (palette-len  (length  $2))) 

(alist-for-each 

(lambda  (scalar-id  scalar) 

(if  (=  $0  (vector-ref  scalar  2)) 

(begin 

(set!  %0  (+  %0  1) ) 

(if  (=  %0  10) 

(begin 

(beep) 

(set!  %0  0))) 

<line  view="''0"  from="  (vector-ref  scalar  0)  (vector-ref  scalar  1)" 

to="$0  (vector-ref  scalar  3)"  thick=" (+  (quotient  scalar-id  palette-len)  1)" 
colors" (nth  (modulo  scalar-id  palette-len)  $2)"  adapt="yes"  clip="margin">) ) ) 

$1)) 


Figure  5.  Code  of  the  IR  component  used  as  lineplotrender  in  Figure  4b 
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Figure  6.  VOML  compilation  and  execution  process  diagram 


3.3  The  VOML  compiler 

The  VOML  compiler  is  built  on  top  of  an  SGML  trans¬ 
formation  library  called  STIL  [21]  and  consists  of  the  fol¬ 
lowing  components. 

SGML  parser.  The  sgmls  parser  [4]  is  used  as  the  front- 
end  that  parses  an  SGML  declaration,  VOML  DTD, 
and  external  entities  used  in  a  VOML  specification  of 
anHLVO. 

STIL  Ubrary.  This  library  is  written  for  the  clisp  [10] 
implementation  of  Common  Lisp  with  CLOS.  It  al¬ 
lows  traversing  a  parse  tree  created  by  the  SGML 
parser,  and  defining  “hooks”  (semantic  actions)  that 
are  called  during  the  traversal. 

VOML  validating  parser.  One  part  of  this  component 
consists  of  the  hooks  called  by  the  STIL  library.  The 
other  part  consists  of  CLOS  objects  that  contain  code 
and  oAer  information  relevant  to  EPIRA  components 
of  the  HLVO  specification  being  compiled.  The  hooks 
process  VOML  elements  (including  the  contents  in 
Scheme),  their  relations  and  attributes,  and  build  the 
CLOS  objects. 


VOML  code  generator.  This  component  “tangles”  the 
plain,  application-  and  visualization-specific  Scheme 
code  from  a  VOML  specification  with  code  that  it  gen¬ 
erates  for  integration  with  the  run-time  environment. 
The  latter  includes  a  graphical  user  interface  for  ac¬ 
cessing  and  modifying  selected  info  and  control  struc¬ 
tures,  managing  views,  registering  VOs  with  routines 
that  supply  data  events,  etc. 

Figure  6  shows  the  compilation  and  execution  process  of 
a  VOML  specification  in  our  GUILE-based  environment. 
Processes  are  shown  as  oval  rectangles,  while  input,  inter¬ 
mediate  and  output  files  are  shown  as  rectangles.  Solid  lines 
denote  the  process  of  file  inclusion,  while  dashed  lines  de¬ 
note  references  in  VOML  and  Scheme  files. 

The  VOML  design  allows  for  extending  the  compiler  to 
support  other  Scheme-based  run-time  environments.  Inter¬ 
esting  extensions  would  be  for  Kawa  [2],  a  Scheme  com¬ 
piler  written  in  Java  that  generates  JVM  bytecodes,  and 
Skij  [13],  a  Scheme  interpreter  that  allows  rapid  prototyp¬ 
ing  in  the  Java  environment.  With  a  LLVO  class  implemen¬ 
tation  in  Java,  it  would  become  relatively  easy  to  develop 
PAV  applications  running  in  a  WWW  browser,  leveraging 
platform-independent  VOML  specifications,  and  receiving 
performance  data  over  the  Internet.  Alternatively,  a  simpler 
version  of  VOML  could  be  defined  in  XML  [15]  instead  of 
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SGML,  and/or  Java  could  be  used  instead  of  Scheme  to  both 
compile  and  execute  VOML  specifications. 

While  an  SGML  parser  uses  an  entity  manager  to  find 
components  of  a  document  which  are  referenced  as  external 
entities — such  as  library  EP  and  IR  components — within 
its  virtual  storage  system,  the  SGML  standard  itself  does 
not  specify  how  to  implement  one.  A  WWW-enabled  en¬ 
tity  manager  would  further  enlarge  the  PAV  information  in¬ 
frastructure  and  automatize  monitoring  and  PAV  of  globally 
distributed  applications.  Figure  7  shows  an  example  of  how 
component  definitions  could  be  fetched  from  a  WWW  site 
by  the  entity  manager,  to  be  included  for  compilation.  In 
the  example,  the  vendor  of  an  imaginary  software  product, 
whose  performance  we  want  to  visualize,  keeps  the  latest 
implementations  of  an  EP  and  IR  component  for  the  prod¬ 
uct,  ready  to  be  used  in  our  HLVO  specification^. 

< ! DOCTYPE  VOML  PUBLIC  “ - / / MSU- PORT / / DTD  VOML  1 . 0 / / EN ” 

[ 

<! ENTITY  SoftwareXYZep 

SYSTEM  "http: //vendor, com/ voml/XYZep.voml"> 

<! ENTITY  SoftwareXYZir 

SYSTEM  "http: / /vendor . com/ voml /XYZir . voml "> 

] 

> 

<voml> 

< e ven t -pr oc e s s i ng> 

<ep-component  name="XYZep" 

inputs="mydata. fl . f2 "  ...> 

&  S  o  f  twar eXY  Z  ep ; 

</ep-component> 

<info-rendition> 

<ir-component  ncime=  "XYZir " 

views="myview"  . . .> 
&SoftwareXYZir; 

</ir-component> 


Figure  7.  Sketch  of  a  VOML  specification  that 
uses  remote  component  definitions 


4  The  VOML  Specification  of  a  Simple  Vi¬ 
sual  Object 

In  this  section,  we  present  and  comment  on  main  parts 
of  the  VOML  specification  of  a  simple  VO  with  a  view 
similar  to  the  last  one  of  the  VO  shown  in  Figure  2.  The 
VO  receives  performance  data  events  from  a  distributed  ap¬ 
plication,  generated  whenever  a  node  is  (1)  added  or  (2) 
removed,  and  periodically  to  carry  profile  data  from  each 
node.  The  event  declaration  section  is  shown  in  Figure  8. 
The  record  and  event  types  of  the  first  two  events  are  taken 
from  the  PICL  specification  [26],  while  the  profile  event  be¬ 
longs  to  an  extension  of  PICL.  When  some  field  are  skipped 

is  assumed  that  we  already  have  the  information  about  the  compo¬ 
nents’  interfaces. 


(i.e.,  ignored),  the  index  attributed  is  used  to  specify  the 
position  of  the  next  declared  field. 


<  even  t-declaratic>ns> 

< data- event  name="addnode" 

rtype="pg-entry"  etype="-901"> 

<data-field  name="ts"  type="int"> 

<data- field  name="node-id"  type="int"> 

</data-event> 

<data- event  name="rmnode" 

rtype="pg-exit"  etype="-901"> 

<data-field  name="ts"  type=:"real"> 

<data- field  narae="node-id"  type="int"> 

</data-event> 

<data-event  name= "node-prf " 

rtype=" entry"  etype="3141 "> 

<data- field  name="ts"  type='’real  "> 

<data- field  name="node“id"  index="3"  type="int"> 
<data-field  name=  "node -type"  index="5"  type=:"int"> 
<data- field  name="rkbps"  type="real"> 

<data- field  name="tb"  type="real"> 

<data-field  name="used"  type="real"> 

<data-field  name="fps"  type="real"> 

<data- field  name= "packets"  type="int"> 

<data-field  name= "pack-used"  type="int"> 
</data-event> 

</event-declarations> _ _ [ _ 

Figure  8.  Event  declarations 

The  info  and  control  structure  specifications  are  shown  in 
Figure  9.  The  info  variable  numof  nodes  (although  redun¬ 
dant)  keeps  the  current  number  of  communicating  nodes; 
nodes  is  an  association  list  that  keeps  the  previous  and  cur¬ 
rent  profile  of  each  node;  nodeno  keeps  the  (non-negative) 
node  id  from  the  latest  profile  event.  The  control  variable 
nodechange  indicates  whether  the  number  of  nodes  has 
changed  (meaning  that  the  view  has  to  be  updated,  as  will 
be  seen  later). 


<inf o-structures> 

<variable  name=" numof nodes"  type="int"> 

<variable  name="nodes"  type="list"> 

<variable  name= "nodeno"  type="int"  init="-l"> 
</info-structures> 

<control-structures> 

<variable  name= "nodechange"  type=" boolean "> 
</control-structures>  _ _ _ 

Figure  9.  Info  and  control  structures 

The  next  is  the  utility  code  section,  which  we  omit  for 
brevity.  It  contains  function  getf  ontname  that  returns  a 
font  name  from  a  list  of  available  fonts,  given  some  hints. 
Besides,  functions  id-get,  id-put  and  id-rem,  which 
are  used  to  manipulate  the  nodes  association  list,  may  be 
defined  in  this  section  (in  our  case,  they  are  defined  and  ex¬ 
ported  from  another  module,  available  in  the  run-time  envi¬ 
ronment). 

The  view  initialization  section  is  shown  in  Figure  10. 
The  (only)  BU-View  view  is  700  by  700  pixels,  through 
which  a  rectangle  in  the  world  coordinate  system  from 
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(—10,-10)  to  (110,120)  is  visible.  In  this  example, 
the  view  neither  scrolls  nor  zooms.  The  control  variable 
nodechange  is  set  to  true  only  to  trigger  the  drawing  of 
the  switch  in  the  beginning: 


Figure  10.  View  initialization 


<event-processing> 

<ep- component  name='*rinnode"  inputs="nnnode. ts .node-id" 
infos=''nuinofnodes  nodes  nodeno"  control s=  "nodechange "> 
<description>Remove  a  node</ descript ion> 

< input  naine='*''0  "> 

(set!  $1  (id-rem  $1  ''0/1)) 

(set!  $0  (-  $0  D) 

(set!  $2  -1) 

(set!  %0  #t) 

</input> 

</ep-component> 

<ep-component  name="addnode"  input s="addnode . ts .node-id" 
inf  os  ='*numof  nodes  nodes  nodeno"  controls=  "nodechange  "> 
<description>Add  a  node,  reset  the  infos</description> 
< input  name="''0"> 

(set!  $1  (id-put  $1  "0/1 

(cons  (vector  "0/0000000000) 

(vector  "0/0  000000000)))) 

(set!  $0  (+  $01)) 

(set!  $2  -1) 

(set!  %0  #t) 

</ input > 

<  /  ep  -  coir^onen  t  > 

<ep- component  name=’'nodeprofile" 

input s= "node-prf . ts . node- id . node -type . rkbps . tb . used . fp 
s . packets . pack-used" 

inf os="numof nodes  nodes  nodeno "> 

<description>Update  a  node's  infos</description> 

<input  name=""0"> 

(let  ((old-info  (edr  (id-get  $1  "0/1))) 

(new-info  (vector  "0/0  "0/1  "0/2  "0/3  "0/4 
"0/5  "0/6  "0/7  "0/8))) 

(set!  $1  (id-put  $1  "0/1 

(cons  old-info  new-info) ) ) ) 

(set!  $2  "O/D 
</input> 

<  /  ep  -  cort^onen  t  > 

</ event -processing> 

Figure  11.  Event  processing  components 

There  is  one  EP  component  for  each  event,  although  one 
could  implement,  for  example,  only  one  for  all  the  three 
events.  They  are  shown  in  Figure  11,  as  updating  the  info 
and  control  structures  according  to  the  event  declarations. 
Fields  of  an  event  are  listed  using  a  notation  in  which  the 
event  name  is  followed  by  some  of  its  fields’  names,  delim¬ 
ited  by  periods.  In  the  nodeprofile  EP  component,  all 
the  event  fields  declared  above  are  used.  It  is  not  necessary 
to  use  them  all  and  in  the  same  order  as  declared;  the  '^m/n 


<ir-coitponent  name=  "nodes- ir"  views="BU-View" 
infos=s"numofnodes"  control s= "nodechange" 
refresh=" update  resize"  buffer="yes"  enable="%0"> 
<description>Switch,  nodes,  connect! ons</ descript ion> 
(let*  ( (viewinfo  <view-info  view=""0">) 

(width  (list-ref  viewinfo  5)) 

(height  (list-ref  viewinfo  6)) 

(size  (inexact->exact 

(max  (/  width  40)  (/  height  40)))) 
(font  (getfontname  "fonttable" 

"courier”  size  "bold"))) 

<text  view=""0"  coords="50  107"  halign= "CENTER" 
font="font"  fcolor=' "black" ' 
contents' "Bandwidth  Utilization" ' > 

<figure  view=""0"  filenames' "bggif /switch. gif" ' 
orig- origins "0  0"  orig-extents="0  0" 
world-origins "45  45"  world-extentss"10  10 "> 

(let*  ( (nodenum  (-  $0  1)) 

(step  (/  6.28  nodenum))) 

(if  (gt  nodenum  0) 

(let  loop  ( (num  nodenum) ) 

(let*  ((angle  (*  num  step)) 

(sine  (sin  angle) ) 

(cosine  (cos  angle) ) ) 

<figure  view=""0" 

filenames' "bggif /node. gif"' 
orig-origins"0  0"  orig-extentss"0  0" 
world-origins" (+  45  (*  45  sine)) 

(+  45  (*  45  cosine))" 
world-extentss"l0  10 "> 

<line  view=""0"  from="(+  50  (*  6  sine)) 

(+  50  (*  6  cosine))" 
to=" (+  50  (*  39  sine) ) 

(+  50  (*  39  cosine))" 
colors ' "red" '  thick="12"> 

(if  (gt  num  1) 

(loop  (-  num  1) )))))) ) 

<end-with> (set !  %0  #f ) </end-with> 

</ir-component>  _ _ 

Figure  12.  Template  IR  component 


notation  uses  the  order(s)  given  in  the  inputs  attribute.  It 
can  be  seen  that  the  field  node -id  is  used  as  the  key,  and 
the  value  field  in  the  nodes  association  list  is  a  pair  of  vec¬ 
tors  keeping  the  previous  and  current  profile  of  a  node.  In 
this  example,  only  the  current  profile  will  be  used,  but  in  a 
more  complex  VO  both  the  previous  and  current  one  may 
be  needed. 

Finally,  the  information  rendering  section  consists  of  two 
IR  components.  The  nodes -ir  IR  component,  which  is 
shown  in  Figure  12  and  will  be  executed  first^®,  writes  text 
and  draws  the  switch  and  as  many  globes”  around 

it  as  there  are  active  nodes,  connected  with  the  switch  via 
thick  red  lines.  The  enable  attribute  specifies  that  this 
IR  component  should  be  executed  whenever  the  number 
of  nodes  has  changed.  The  refresh  attribute  adds  that 
everything  the  IR  component  drew  last  time  should  be  re¬ 
drawn  when  the  view  BU-View  is  resized.  It  also  speci¬ 
fies  that  everything  the  IR  component  drew  last  time  has  to 
be  undrawn  before  something  new  is  drawn  (whenever  the 
IR  component  is  enabled).  The  buffer  attribute  is  used 

the  prototype  implementation  of  the  VOML  compiler,  the  execu¬ 
tion  order  of  the  IR  components  is  opposite  of  the  order  they  appear  in  a 
specification. 
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<ir-coniponent  name=’'bu-ir "  views=" 

BU-View" 

infos="nuniofnodes  nodes  nodeno" 

ref resh= " res i ze " > 

<description>  Bandwidth  utilization  </description> 

(if  (gt  $2  -1) 

(let*  ( (angle 

(*  $2  (/  6.28 

(-  $0  1)))) 

(sine 

(sin  angle) ) 

(cosine 

(cos  angle) ) 

(new- info 

(cdr  (id-get 

$1  $2))) 

(newkbps 

(vector-ref  new-info  3)) 

(newused 

(vector-ref  new-info  5)) 

(mag 

(*  25  (/  newused  newkbps)))) 

<line  view="''0’' 

from="(+  50 

(*  6  sine) ) 

(+  50 

(*  6  cosine))" 

to="(+  50  (* 

39  sine) ) 

(+  50  (* 

39  cosine) ) " 

color=' "red" ' 

thick="12"> 

<line  view="‘'0" 

from=" (+  50 

(*  6  sine) ) 

(+  50 

(*  6  cosine))" 

to=:"  (+  50  (* 

(+  6  mag)  sine) ) 

(+  50  (* 

(+  6  mag)  cosine) ) " 

color=' "blue" 

'  thicks" 10 "> 

(set!  $2  -1))) 

</ir-component> 

Figure  13.  Active  IR  component 


to  make  the  IR  component  draw  in-memory  only  until  it 
is  done,  and  then  flush  the  contents  of  the  memory  to  the 
screen.  This  is  useful  to  make  the  rendering  smoother  and 
faster  when  there  are  many  graphical  objects  to  be  drawn. 
This  IR  component  resets  the  nodechange  control  vari¬ 
able  in  the  second  phase,  so  that  other  IR  components  may 
be  added  safely  that  depend  on  the  value  of  this  variable^  ^ 

The  other  IR  component  is  executed  each  time  an  event  is 
received,  and  it  draws  a  blue  thick  line  on  top  of  a  red  thick 
line  (drawn  in  the  same  place  as  the  thick  red  line  drawn 
by  the  former  IR  component,  so  that  it  can  effectively  be 
undrawn),  showing  the  relative  bandwidth  used  by  a  node 
(if  a  profile  event  was  received  last  that  set  the  nodeno 
info  variable  to  a  non-negative  value).  Its  code  is  shown  in 
Figure  13.  The  refresh  attribute  treats  the  resizing  of  the 
view  same  as  above, 

A  snapshot  of  the  view  is  given  in  Figure  14.  We  have  not 
shown  all  the  VOML  features  in  this  example;  for  a  tutorial, 
please  visit  the  URL  given  in  Section  6. 

5  Related  Work 

ParaGraph  [11]  is  a  PAV  tool  for  parallel  programs, 
based  on  the  PICL  communication  library.  PAV  environ¬ 
ments  are  progressing  with  features  to  incorporate  new  ana¬ 
lysis  and  display  modules.  Visualization  environments  are 
not  only  becoming  extensible,  but  retargetable  to  different 
analysis  scenarios.  Pablo  took  this  research  one  step  fur¬ 
ther  by  incorporating  support  for  performance  environment 
prototyping  [20].  VIZ  continues  in  this  direction  by  focus¬ 
ing  on  the  visualization  technology  required  for  application- 

Scheme,  this  second  phase  is  implemented  using  delay  and 
force. 


Figure  14.  A  snapshot  of  the  view 


specific  performance  visualizations  [12].  Avatar  [19]  uses 
Pablo  to  study  two  types  of  high-performance  input/output 
of  the  Portable  Parallel  File  System  (PPFS):  parallel  sci¬ 
entific  codes  and  WWW  servers.  The  Rivet  project  [17] 
integrates  new  visualization  tools  into  the  design  and  evalu¬ 
ation  process  of  a  variety  of  computer  system  components, 
specifically  processor  and  memory  systems,  multiprocessor 
architectures,  compilers,  operating  systems,  and  networks. 
Lucent  Technologies’  Visual  Insights  [24]  offers  a  set  of  in¬ 
teractive  and  linked  data  visualization  components  for  the 
Microsoft  ActiveX  developer  market  that  help  software  de¬ 
velopers  to  create  more  flexible,  animated  ways  to  display 
trends  in  vast  stores  of  information. 

In  Table  1  we  compare  PG^^  visual  objects  with  related 
PAV  tools  and  systems.  While  some  of  the  latter  have  gone 
farther  in  certain  direction,  such  as  the  graphical  metaphor, 
our  design  decisions  were  primarily  based  on  the  require¬ 
ments  stated  in  Section  1.  We  have  also  been  concerned 
that  insisting  on  state-of-the-art,  academic  software  tech¬ 
nologies,  such  as,  for  example,  lazy  functional  languages, 
could  limit  the  practicality  of  our  approach.  Instead,  using 
technologies  that  are  gaining  acceptance,  such  as  SGML 
(e.g.,  in  the  Chemical  Markup  Language  [18])  and  use  of 
structure,  components  and  scripting  (e.g.,  in  VRML  [23]), 
we  hope  to  contribute  to  the  PAV  community. 
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Tool/System 

On/off-line 

operation 

Graphical 

metaphor 

Underlying  graphi¬ 
cal  technology 

View  classes 

Reusable 

ParaGraph 

off-line 

data-flow 

X  library 

generic 

no 

Pablo  widgets 

off-line 

data-flow 

X  library 

generic 

yes 

Avatar 

on-line 

data-flow 

VRML 

scattercube  only 

no 

VIZ 

on-line 

data-reactive 

Open  Inventor 

domain-specific 

yes 

Rivet 

off-line 

data-flow 

OpenGL 

domain-specific 

yes 

Visual  Insights 

off-line 

n/a 

n/a 

generic 

yes 

PGRT  yo 

on-line 

data-flow 

low-level  VO 
implementations 

domain-specific 

yes 

Table  1.  Performance  visualization  tools  and  systems 


6  Conclusions 

We  have  presented  a  novel  PAV  technology  intended 
to  satisfy  growing  needs  by  researchers  and  users  of  par¬ 
allel  and  distributed  systems.  Salient  characteristics  of 
the  technology  include  support  for  rapid  prototyping  and 
automated  design  of  PAV  tools,  object  orientation,  dis- 
tributability,  portability,  code  reuse  and  flexibility.  Tu¬ 
torials  with  examples  and  reference  manuals  for  PG^"^- 
TIE  and  VOML  can  be  found  in  the  Documents  section  at 
http:/ /WWW . egr . msu . edu / Pgr t / . 

In  the  future,  we  plan  to  extend  and  improve  the  tech¬ 
nology,  and  make  it  available  to  the  PAV  community.  We 
will  develop  visual  objects  specific  to  parallel/distributed 
real-time  applications,  but  also  try  to  help  PAV  developers 
using  our  technology  in  other  areas  by  developing  domain- 
specific  libraries  of  EP  and  IR  components. 
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Abstract 

Harness  is  a  Java-centric,  experimental  metacomputing 
framework  based  upon  the  principle  of  dynamic 
enrollment  and  reconfiguration  of  heterogeneous 
computational  resources  into  distributed  virtual  machines. 
The  dynamic  behavior  of  the  system  is  not  limited  to  the 
number  and  types  of  computers  and  networks  that 
comprise  the  virtual  machine,  but  also  extends  to  the 
capabilities  of  the  virtual  machine  itself  These 
fundamental  characteristics  address  the  inflexibility  of 
current  metacomputing  frameworks  as  well  as  their 
incapability  to  easily  incorporate  new,  heterogeneous 
technologies  and  architectures  and  avoid  rapid 
obsolescence.  The  adaptable  behavior  of  Harness  derives 
both  from  a  user  controlled,  distributed  "plug-in" 
mechanism  and  from  an  event  driven,  dynamic 
management  of  the  distributed  virtual  machine  status  that 
are  central  features  of  the  system. 


1  Introduction 

Harness  is  an  experimental,  Java-centric 
metacomputing  framework  based  upon  the  principle  of 
dynamic  enrollment  and  reconfiguration  of  heterogeneous 
computational  resources  into  networked  virtual  machines. 
The  reconfiguration  capabilities  of  Harness  are  not  limited 
to  the  set  of  computers  and  networks  enrolled  in  the  virtual 
machine,  but,  on  the  contrary,  they  also  include  the 
services  offered  by  the  virtual  machine  itself.  Further  on, 
reconfiguration  is  not  constrained  to  take  place  during  the 
virtual  machine  setup  phase  but  can  also  be  applied  at  run¬ 
time.  This  level  of  reconfigurability  is  allowed  by  a  user- 
controlled,  distributed  ”plug-in"  mechanism  together  with 
a  dynamic,  fault  tolerant  management  of  the  distributed 
virtual  machine  status. 

The  motivation  for  a  plug-in-based  approach  to 
reconfigurable  virtual  machines  derives  from  two 


observations.  First,  new  advances  in  information 
technology  require  distributed  and  cluster  computing  to 
adapt  to  heterogeneous  processors,  interconnection 
network  types  and  protocols  in  order  to  be  able  to  take 
advantage  of  them.  For  example,  the  availability  of 
Myrinet  [1]  interfaces  and  Illinois  Fast  Messages  [2]  has 
recently  led  to  new  models  for  closely  coupled  Network 
Of  Workstations  computing  systems.  Similarly,  multicast 
protocols  and  better  algorithms  for  video  and  audio  codecs 
have  led  to  a  number  of  projects  that  focus  on  tele¬ 
presence  over  distributed  systems.  In  these  instances 
metacomputing  frameworks  not  able  to  cope  with 
computational  resources  that  are  heterogeneous  both  in 
architecture  and  in  connectivity  are  subject  to  rapid 
obsolescence.  In  fact,  the  underlying  middleware  either 
needs  to  be  changed  or  re-constructed,  thereby  increasing 
the  effort  level  involved  and  hampering  interoperability. 
On  the  contrary,  a  virtual  machine  model  intrinsically 
incorporating  reconfiguration  capabilities  in  terms  of  the 
services  provided  by  any  single  computational  resource 
will  allow  the  definition  of  a  consistent  service  baseline  in 
such  an  evolving  environment  and  will  address  these 
issues  in  an  effective  manner. 

Second,  applications  characterized  by  long  run  times 
require  a  virtual  machine  environment  that  can 
dynamically  adapt  the  available  resources  to  meet  the 
application's  needs,  rather  than  forcing  the  application  to 
fit  into  a  fixed  environment.  As  an  example  we  can  cite 
long-lived  simulations  such  as  climate  simulations.  These 
applications  evolve  through  several  phases,  data  input, 
problem  setup,  calculation,  and  analysis  or  visualization  of 
results,  each  one  with  its  own  profile  in  terms  of  resource 
requests  and  requirements.  In  order  to  maximize  the 
availability  of  resources  to  the  application  a  metacomputer 
needs  to  be  able  to  dynamically  enroll  heterogeneous 
computational  resources  into  the  resource  pool.  At  the 
same  time,  if  the  metacomputer  is  not  able  to  reconfigure 
the  resources  to  suit  the  needs  of  the  application  the 
availability  is  purely  nominal  and  the  utilization  of  the 
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available  resources  is  poor.  On  the  contrary,  by  allowing 
applications  to  dynamically  enroll  and  reconfigure 
heterogeneous  computational  resources  at  run  time,  the 
overall  utilization  of  the  computing  infrastructure  can  be 
enhanced  and  the  need  for  a  consistent  service  baseline 
fulfilled. 

The  overall  goal  of  the  Harness  project  encompasses 
several  different  issues  such  as  dynamic  reconfiguration 
and  management  of  distributed  virtual  machines,  fault- 
tolerance,  security,  authentication  and  access  control. 
However,  in  this  paper  we  focus  our  attention  in 
investigating  and  developing  a  mechanism  for  fault 
tolerant,  dynamic  reconfiguration  and  management  of 
distributed  virtual  machines.  Within  the  framework  of  an 
heterogeneous  computing  environment,  this  mechanism 


allows  users  and  applications  to  dynamically  customize, 
adapt,  and  extend  the  distributed  computing  environment's 
features  to  match  their  needs  without  compromising  the 
consistency  of  the  programming  environment  itself. 

The  paper  is  structured  as  follows:  in  section  2  we 
describe  the  abstract  architecture  of  Harness  DVMs  and 
our  design  choices;  in  section  3  we  detail  the  architecture 
of  the  system;  in  section  4  we  describe  some  example 
plug-ins  and  applications;  in  section  5  we  relate  our  work 
with  other  similar  projects;  finally  in  section  6  we  provide 
some  concluding  remarks. 
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2  Fundamental  abstractions,  terminology  and 
design  choices 

The  fundamental  abstraction  in  the  Harness 
metacomputing  framework  is  the  Distributed  Virtual 
Machine  (DVM)  (see  figure  1,  level  1).  Any  DVM  is 
associated  with  a  symbolic  name  that  is  unique  in  the 
Harness  name  space,  but  has  no  physical  entities 
connected  to  it.  Heterogeneous  Computational 
Resources  may  enroll  into  a  DVM  (see  figure  1,  level  2) 
at  any  time,  however  at  this  level  the  DVM  is  not  ready 
yet  to  accept  requests  from  users.  To  get  ready  to  interact 
with  users  and  applications  the  heterogeneous 
computational  resources  enrolled  in  a  DVM  need  to  load 
plug-ins  (see  figure  1,  level  3).  A  plug-in  is  a  software 
component  implementing  a  specific  service.  By  loading 
plug-ins  a  DVM  can  build  a  consistent  service  baseline 
(see  figure  1,  level  4).  Users  may  reconfigure  the  DVM  at 
any  time  (see  figure  1,  level  4)  both  in  terms  of 
computational  resources  enrolled  by  having  them  join  or 
leave  the  DVM  and  in  terms  of  services  available  by 
loading  and  unloading  plug-ins. 

The  main  goal  of  the  Harness  metacomputing 
framework  is  to  achieve  the  capability  to  enroll 
heterogeneous  computational  resources  into  a  DVM  and 
make  them  capable  of  delivering  a  consistent  service 
baseline  to  users.  This  goal  require  the  programs  building 
up  the  framework  to  be  as  portable  as  possible  over  an  as 
large  as  possible  selection  of  systems.  The  availability  of 
services  to  heterogeneous  computational  resources  derives 
from  two  different  properties  of  the  framework:  the 
portability  of  plug-ins  and  the  presence  of  multiple 
searchable  plug-in  repositories.  Harness  implements  these 
properties  mainly  leveraging  two  different  features  of  Java 
technology.  These  features  are  the  capability  to  layer  a 
homogeneous  architecture  such  as  the  Java  Virtual 
Machine  (JVM)  [3]  over  a  large  set  of  heterogeneous 
computational  resources,  and  the  capability  to  customize 
the  mechanism  adopted  to  load  and  link  new  objects  and 
libraries.  However,  the  adoption  of  the  Java  language  as 
the  development  platform  for  the  Harness  metacomputing 
framework  has  given  us  several  other  advantages: 

•  it  allowed  us  to  develop  the  framework  as  a  collection 
of  cooperating  objects  with  consistent  boundaries 
(Java  Classes)  and  to  guarantee  to  users  an  00 
development  environment; 

•  it  allowed  us  to  define  a  clear  and  consistent  boundary 
for  plug-ins,  in  fact  each  plug-in  is  required  to  appear 
to  the  system  as  a  Java  class; 

•  it  allowed  us  to  implement  all  the  entities  in  the 
framework  adopting  a  robust  multithreaded 
architecture; 

•  it  allows  users  to  develop  additional  services  both  in  a 
passive,  library-like  flavor  and  in  an  active  thread- 


enabled  flavor; 

•  it  provided  us  an  Object  Oriented  mechanism  to 
require  services  from  remote  computational  resources 
(Java  Remote  Method  Invocation  [4]); 

•  it  provided  us  a  generic  methodology  to  transfer  data 
over  the  network  in  a  consistent  format  (Java  Object 
Serialization  [5]); 

•  it  allowed  us  to  provide  to  users  the  definition  of 
interfaces  to  be  implemented  by  plug-ins 
implementing  the  basic  services; 

•  it  allowed  us  to  tune  the  trade-off  between  portability 
and  efficiency  for  the  different  components  of  the 
framework. 

This  last  capability  is  extremely  important,  in  fact, 
although  portability  at  large  is  needed  in  all  the 
components  of  the  framework,  it  is  possible  to  distinguish 
three  different  categories  among  the  components  that 
requires  different  level  of  portability.  The  first  category  is 
represented  by  the  components  implementing  the 
capability  to  manage  the  DVM  status  and  load  and  unload 
services.  We  call  these  components  kernel  level  services. 
These  services  require  the  highest  achievable  degree  of 
portability,  as  a  matter  of  fact  they  are  necessary  to  enroll 
a  computational  resource  into  a  DVM.  The  second 
category  is  represented  by  very  commonly  used  services 
(e.g.  a  general,  network  independent,  message  passing 
service  or  a  generic  event  notification  mechanism).  We 
call  these  services  basic  services.  Basic  services  should  be 
generally  available,  but  it  is  conceivable  for  some 
computational  resources  based  on  specialized  architecture 
to  lack  them.  The  last  category  is  represented  by  highly 
architecture  specific  services.  These  services  include  all 
those  services  that  are  inherently  dependent  on  the  specific 
characteristics  of  a  computational  resource  (e.g.  a  low- 
level  image  processing  service  exploiting  a  SIMD  co¬ 
processor,  a  message  passing  service  exploiting  a  specific 
network  interface  or  any  service  that  need  architecture 
dependent  optimization).  We  call  these  services 
specialized  services.  For  this  last  category  portability  is  a 
goal  to  strive  for,  but  it  is  acceptable  that  they  will  be 
available  only  on  small  subsets  of  the  available 
computational  resources.  These  different  degrees  of 
required  portability  and  efficiency  over  heterogeneous 
computational  resources  can  optimally  leverage  the 
capability  to  link  together  Java  byte  code  and  system 
dependent  native  code  enabled  by  the  Java  Native 
Interface  (JNI)  [6].  The  JNI  allows  to  develop  the  parts  of 
the  framework  that  are  most  critical  to  efficient  application 
execution  in  ANSI  C  language  and  to  introduce  into  them 
the  desired  level  of  architecture  dependent  optimization  at 
the  cost  of  increased  development  effort. 

The  use  of  native  code  requires  a  different 
implementation  of  a  service  for  each  type  of 
heterogeneous  computational  resource  that  need  to  deliver 
that  service.  This  fact  implies  a  development  effort 
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multiplied  for  each  plug-in  including  native  code. 
However,  if  a  version  of  the  plug-in  for  a  specific 
architecture  is  available,  the  Harness  metacomputing 
framework  is  able  to  fetch  and  load  it  in  a  user  transparent 
fashion,  thus  users  are  screened  from  the  necessity  to 
control  the  set  of  architectures  their  application  is 
currently  running  on.  To  achieve  this  result  Harness 
leverages  the  capability  of  the  JVM  to  let  users  redefine 
the  mechanism  used  to  retrieve  and  load  both  Java  classes 
bytecode  and  native  shared  libraries.  In  fact,  each  DVM  in 
the  framework  is  able  to  search  a  set  of  plug-ins 
repositories  for  the  desired  library.  This  set  of  repositories 
is  dynamically  reconfigurable  at  run-time,  users  can  add 
new  repositories  at  any  time. 

3  Harness  requirements  and  architecture 

3.1  Prototype  implementation  constraints  and 
long  term  choices 

The  main  requirement  for  a  computational  resource  to 
be  enrolled  in  a  DVM  is  the  capability  to  run  Java 
programs,  i.e.  the  presence  of  an  implementation  of  a  JVM 
on  the  given  architecture.  This  is  not  a  constraint  derived 
from  prototype  implementation  but  rather  a  general  design 
decision  and  we  don’t  foresee,  at  this  moment,  any  reason 
to  change  it.  However,  this  is  not  a  very  restrictive 
constraint,  as  a  matter  of  fact  all  the  major  UNIX 
platforms  as  well  as  Microsoft  platforms  provide  a 
functional  JVM. 

A  more  restrictive  constraint  that  the  current 
implementation  of  the  framework  imposes  to 
computational  resources  is  the  requirement  to  support  IP 
multicast  communication.  In  fact,  both  the  discovery-and- 
join  protocol  and  the  recovery-from-crash  protocol  utilize 
this  type  of  communication.  The  first  protocol  implements 
the  capability  to  search  the  Harness  name  space  for  an 
active  DVM  and  enroll  into  it,  while  the  second  protocol  is 
used  to  keep  a  consistent  status  in  the  event  of  host  or 
network  crashes.  To  relax  this  restriction  we  plan  to 
develop  a  non-multicast  version  of  these  two  protocols 
relying  on  centralized  services  for  future  releases  of  the 
software,  however  the  current  prototype  requires  the 
support  for  IP  multicast. 

A  third  constraint  imposed  by  the  framework  is  the 
requirement  for  plug-ins  to  appear  to  the  system  as  Java 
classes.  Anyway,  this  requirement  implies  very  small 
restrictions  on  the  way  users  may  develop  plug-ins,  in  fact 
a  plug-in  is  not  required  to  be  a  monolithic  entity  and  may 
depend  on  any  number  of  other  classes.  Besides,  if 
efficiency  requirements  are  too  strict  to  be  fulfilled  by 
pure  Java  code,  it  is  possible  to  exploit  the  JNI  to  either 
manually  wrap  native  code  into  Java  classes  [7]  or  to 
exploit  one  of  the  tools  that  perform  automatic  wrapping 


of  legacy  code  [8]. 

Harness  provides  a  flat  mapping  of  Internet  host  names 
ont  computational  resources  names.  However,  the 
framework  allows  each  user  to  adopt  his  own 
computational  resource  naming  and  grouping  scheme  as 
long  as  he  is  able  to  provide  a  plug-in  to  map  that  scheme 
onto  the  Internet  host  names  space.  Users  willing  to  share 
a  same  name  space  need  to  use  the  same  name-mapping 
service.  This  mechanism  allows  the  definition  of  abstract 
user-defined  computational  resources  naming  policies 
without  compromising  the  coherency  of  the  underlying 
metacomputing  framework.  The  same  mechanism 

The  same  mechanism  has  been  adopted  for  plug-ins 
names.  Thus,  the  framework  provides  a  flat  mapping  of 
names  of  plug-in  names  onto  Java  class  names,  but  each 
user  can  register  his  own  grouping  and  naming  policy 
providing  his  own  mapper  plug-in. 

3.2  Harness  implementation  and  protocols 

The  kernel  level  services  of  a  Harness  DVM  are 
delivered  by  a  distributed  system  composed  of  two 
categories  of  entities: 

•  a  DVM  status  server,  unique  for  each  DVM; 

•  a  set  of  Harness  kernels,  one  and  only  one  running  on 
each  computational  resource  currently  enrolled  or 
willing  to  be  enrolled  into  a  DVM. 

To  achieve  the  highest  possible  degree  of  portability  for 
the  kernel  level  services  both  the  kernel  and  the  DVM 
status  server  are  implemented  as  pure  Java  programs.  We 
have  used  the  multithreading  capability  of  the  Java  Virtual 
Machine  to  exploit  the  intrinsic  parallelism  of  the  different 
tasks  the  two  entities  have  to  perform,  and  we  have  built 
the  framework  as  a  set  of  Java  packages. 

Control  messages  and  DVM  status  changes  not  related 
to  the  disco very-and-join  protocol  or  the  recover-from- 
failure  protocol,  are  exchanged  through  a  star  shaped  set 
of  reliable  unicast  channels  whose  center  is  the  DVM 
status  server.  These  connections  are  implemented  through 
the  communication  commodities  delivered  by  the  java.net 
package.  It  is  important  to  notice  that  neither  the  star 
topology,  nor  the  use  of  the  java.net  package  are 
constraints  imposed  to  all  the  communication  services  in 
the  framework.  On  the  contrary,  user  level  communication 
services  may  adopt  the  connection  topology  that  best  suit 
their  needs  and  are  not  required  to  use  the  java.net  package 
to  implement  these  commodities.  For  this  reason,  neither 
the  star  topology  interconnecting  the  kernels  and  the  DVM 
Server,  nor  the  fact  that  the  java.net  package  is  used 
represent  a  major  bottleneck  in  the  Harness 
metacomputing  framework.  The  kernels  and  the  DVM 
server  interacts  to  guarantee  a  consistent  evolution  of  the 
status  of  the  DVM  both  in  front  of  users  requesting  new 
services  to  be  added  and  in  front  of  computational 
resources  or  network  failures.  This  consistency  is  enforced 
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by  means  of  a  set  of  protocols  executed  during  the 
different  phases  of  the  DVM  life.  In  the  following 
subsections  we  describe  each  of  them 

3.2.1  DVM  startup  protocol 

A  DVM  may  be  started  in  three  different  ways: 

•  starting  a  DVM  server; 

•  starting  a  kernel; 

•  starting  an  application. 

In  the  first  case,  a  user  invokes  the  execution  of  the 
main  method  of  the  Java  H_Server  class  from  the 
edu.emory.mathcs.harness  package  providing  as  a 
parameter  the  name  of  the  DVM  this  server  is  starting.  The 
DVM  server  executes  a  hashing  function  to  map  the  DVM 
name  into  a  multicast  IP  address  and  port.  Then  it  starts  to 
multicast  on  the  channel  I’m  alive  packets  and  to  listen  for 
incoming  packets.  The  DVM  server  can  get  three  types  of 
packets: 

•  I’m  alive  packets  from  a  DVM  server; 

•  Join  packets  from  kernels; 

•  Query  packets  from  applications. 

The  server  checks  the  source  address  of  any  I’m  alive 
packet  it  receives.  If  the  packet  comes  from  another  server 
the  server  multicasts  a  train  of  I’m  alive  packets  to  notify 
its  presence  to  the  other  server  and  then  it  exits.  This  will 
enforce  the  kernels  running  on  computational  resources 
enrolled  in  the  DVM  to  start  the  server  regeneration 
protocol  and  to  regenerate  a  new,  single  server.  This 
mechanism  prevents  the  existence  of  multiple  DVM 
servers  with  partial  or  outdated  information  and  guarantees 
that  a  single  DVM  server  is  active  in  a  DVM. 

If  the  server  receives  a  join  packet  then  it  generates  a 
TCP  connection  to  the  sender  kernel  and  it  starts  the  Join 
protocol. 

If  the  server  receives  a  query  packet  then  it  checks  if  a 
kernel  exists  on  the  computational  resource  from  which 
the  application  is  querying.  If  a  kernel  is  already  active, 
then  the  server  provides  to  the  querying  application  the 
port  number  on  which  the  kernel  accepts  connections  from 
applications,  otherwise  it  provides  a  null  reply. 

The  second  way  to  start  a  Harness  DVM  is  to  invoke 
the  main  method  of  the  Main  class  in  the 
edu.emory.mathcs.harness  package  providing  as  a  startup 
parameter  the  name  of  the  DVM  the  kernel  wants  to  enroll 
into.  The  kernel  executes  the  hash  function  to  map  the 
DVM  name  into  an  IP  multicast  address  and  port  and 
sends  send  a  Join  packet  on  that  channel.  The  kernel 
performs  three  tries  before  giving  up.  After  three  tries 
have  timed  out  without  a  DVM  server  activating  a  TCP 
connections  the  kernel  assume  no  DVM  server  exists  and 
spawns  a  new  JVM  to  start  a  new  DVM  server.  Then  he 
starts  again  sending  the  Join  packet. 

The  third  way  to  start  a  Harness  DVM  is  to  instantiate 
the  class  H_core  or  H_RMIcore  from  the  package 


edu.emory.mathcs.harness  in  an  application  providing  the 
DVM  name  as  a  parameter.  The  class  constructor  executes 
the  hashing  function  and  drops  a  query  packet  on  the 
multicast  channel.  If  no  answer  comes  back  or  if  the 
answer  says  that  no  kernel  is  active  on  the  computational 
resource  the  constructor  spawns  a  new  JVM  starting  a 
kernel  and  sets  a  flag  to  avoid  starting  a  new  one  even  in 
the  case  of  another  failed  set  of  tries.  The  possibility  of 
two  or  more  applications  racing  to  spawn  to  or  more 
kernels  on  the  same  computational  resource  is  prevented 
by  the  Join  protocol. 

3.2.2  Join  and  leave  protocols 

The  DVM  server  initiates  the  join  protocol  each  time  it 
receives  a  multicast  join  packet.  The  Join  packet  contains 
the  IP  address  and  a  port  number  onto  which  the  willing- 
to-join  kernel  is  accepting  a  TCP  connection.  The  first  step 
of  the  join  protocol  is  the  instantiation  of  a  TCP 
connection  between  the  DVM  server  and  the  Joining 
kernel.  Then  the  DVM  server  waits  for  the  kernel  to 
provide  its  baseline.  At  this  point  the  server  performs  two 
checks:  the  baseline  check  and  the  uniqueness  check.  The 
baseline  check  consists  of  checking  the  compatibility  of 
the  kernel  with  the  current  implementation  of  the  DVM 
server.  The  uniqueness  check  consists  of  checking  that  no 
other  kernel  has  already  joined  from  the  same 
computational  resource.  In  case  of  failure  of  one  of  these 
two  checks  an  error  message  is  sent  back,  the  protocol 
terminates  with  a  failure  and  the  connection  is  closed.  If 
the  kernel  passes  both  controls  then  the  DVM  servers 
checks  if  the  kernel  is  Joining  back  after  a  failure 
(computational  resource  or  network  crash)  or  if  the 
computational  resource  has  never  been  enrolled  in  the 
DVM  before.  If  the  computational  resource  is  coming 
back  from  a  crash  the  DVM  server  sends  to  the  kernel  a 
crash  token  message  and  a  copy  of  its  pre-crash  status, 
otherwise  it  sends  a  new  token  message.  The  following 
step  is  to  get  from  the  kernel  its  current  status  and  to  send 
back  to  it  the  current  status  of  the  DVM. 

At  this  point  the  Join  protocol  is  successfully 
completed,  the  DVM  server  generates  a  Join  event  that  is 
distributed  as  described  in  next  section  while  the  kernel  is 
now  enrolled  in  the  DVM. 

The  leave  protocol  is  much  simpler  that  the  Join 
protocol.  The  leave  protocol  is  always  started  by  a  kernel. 
A  TCP  connection  between  the  kernel  is  guaranteed  to  be 
active,  as  a  matter  of  fact  it  is  not  possible  to  start  the 
Leave  protocol  before  a  successful  completion  of  the  Join 
protocol.  The  kernel  sends  an  explicit  Leave  message  to 
the  DVM  server  and  then  closes  the  TCP  connection.  The 
DVM  server  generates  a  Leave  event  that  is  distributed  as 
described  in  next  section. 
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3.2,3  Totally  ordered  distribution  of  DVM  status 
changes 

The  status  of  the  DVM  consists  of  the  set  of 
computational  resources  currently  enrolled  in  the  DVM, 
the  set  of  services  available  on  each  enrolled 
computational  resource  as  well  as  the  DVM’s  baseline. 
We  call  baseline  of  a  DVM  the  minimum  set  of  services  a 
computational  resource  must  be  able  to  deliver  in  order  to 
join  the  DVM.  The  dynamic  nature  of  the  framework 
make  this  state  an  evolving  entity,  thus  the  framework 
keeps  it  up  to  date  and  available  for  queries  from  any 
application  or  service  in  the  DVM.  It  is  important  to  notice 
that  information  about  the  applications  currently  using 
services  or  internal  status  of  an  application  is  not  part  of 
the  DVM  status  and  loosing  track  of  it  does  not  in  any  way 
compromise  the  existence  of  the  DVM  in  itself.  Any  form 
of  application  tracking  and  check-pointing,  while  highly 
desirable  for  many  applications,  is  a  service  in  itself  and 
the  framework  does  not  need  to  incorporate  it  in  its  status. 

The  Harness  metacomputing  framework  guarantees  that 
all  the  events  that  changes  the  status  of  the  DVM  are 
received  by  all  the  kernels  enrolled  in  the  DVM  in  the 
same  order.  In  the  current  implementation  the  Total  Order 
(TO)  protocol  is  implemented  adopting  the  DVM  server  as 
a  central  ordering  entity  and  exploiting  the  stream  nature 
of  TCP  connections  to  avoid  subsequent  losses  of  order. 
Although  very  simple,  a  centralized  implementation  of  the 
TO  protocol  has  in  general  two  negative  features: 

•  the  central  entity  is  a  single  point  of  failure; 

•  the  central  entity  is  a  bottleneck. 

However,  these  two  problems  do  not  represent  a  major 
flaw  in  the  design  and  efficiency  of  our  framework.  In 
fact,  the  single  point  of  failure  is  limited  to  the  incapability 
of  the  framework  to  retrieve  after  a  DVM  server  crash  the 
status  of  a  previously  crashed  kernel  and  the  central 
bottleneck  does  not  influences  application  level 
communication  services.  The  status  of  a  DVM  as  it  is 
defined  in  the  Harness  metacomputing  framework  consists 
of  the  sum  of  the  stati  of  each  enrolled  kernel.  Each  event 
that  changes  the  status  of  the  DVM  changes  the  status  of  a 
kernel  in  a  way  that  is  recorded  by  the  kernel  itself  with 
the  only  exception  being  the  case  of  a  kernel  crash.  Thus  it 
is  not  possible  for  an  event,  except  for  kernel  crash  events, 
to  get  lost  in  a  DVM  server  crash.  On  the  contrary,  in  the 
case  of  a  DVM  server  crash  it  is  possible  to  reconstruct 
completely  the  current  status  of  the  DVM  simply 
obtaining  from  every  surviving  kernel  a  copy  of  its  current 
status. 

It  is  important  to  notice  that  the  fact  that  this 
reconstruction  process  is  not  able  to  keep  track  of  crashed 
kernels  does  not  mean  that  applications  relying  on  services 
delivered  by  the  crashed  kernels  will  have  as  their  only 
choice  to  stop  and  fail.  Reliable  distributed  check-pointing 
of  application’s  status  and  restart  of  failing  services  are 


services  themselves,  thus  their  behavior  in  the  event  of 
kernel  crashes  is  not  constrained  by  the  DVM  status  and 
the  reconstruction  of  the  DVM  status  is  not  concerned 
with  them. 

To  evaluate  the  bottleneck  represented  by  the  star 
topology  we  have  measured  the  performance  of  the 
following  experimental  setup.  A  SparcStation  5  running 
Solaris  5.6  hosted  the  DVM  server,  a  Harness  kernel  and 
some  other  common  unrelated  applications  (Internet 
browser,  X  server,  etc.).  Harness  kernels  were  hosted  by 
other,  heterogeneous  machines  connected  to  the  DVM 
server  on  a  10  megabit  ethemet  network.  The  average  time 
required  by  DVM  server  to  process  and  distribute  events 
was  10  ms.  We  have  repeated  the  measurements  with  an 
increasing  number  of  enrolled  kernels  but  the  system 
showed  only  a  negligible  overhead  (less  then  10%)  for  up 
to  20  kernels.  Although  10  ms  is  not  a  negligible  amount 
of  time,  it  is  important  to  notice  that  it  involves  only 
events  requiring  DVM  status  changes,  as  a  matter  of  fact 
any  traffic  generated  by  user  application  exchanging  data 
is  not  required  to  flow  through  the  DVM  server.  The  only 
events  that  the  DVM  status  server  needs  to  process  are: 

•  a  kernel  joining  the  DVM; 

•  a  kernel  leaving  the  DVM; 

•  a  kernel  crash; 

•  the  addition  of  a  service  to  the  DVM. 

Thus  the  DVM  server  represents  only  a  marginal 
bottleneck  in  the  Harness  metacomputing  framework. 

3.2.4  The  core  library 

Any  computational  resource  enrolled  in  a  Harness 
DVM  is  able  to  provide  from  the  start  only  a  single 
service:  the  capability  to  add  services  to  the  set  of  service 
currently  available  by  loading  plug-ins  in  a  distributed, 
coordinated  fashion.  We  call  this  capability  the  Basic 
Loading  Service.  Any  additional  service  can  be  plugged-in 
on  demand  if  a  plug-in  able  to  deliver  it  is  available.  This 
design  allows  the  Harness  system  to  be  as  open-ended  as 
possible,  as  a  matter  of  fact  the  only  hard-wired  service  is 
the  basic  loading  service  and  any  other  service  may  be 
developed  at  a  later  stage  and  added  on  demand  by  users 
requiring  it. 

The  requirement  that  each  plug-in  appears  to  the  system 
as  a  Java  class  allowed  us  to  adopt  the  Java  classes  name 
space  as  the  Harness  Plug-in  name  space.  This  name  space 
allows  users  to  generate  fully  package-qualified  names 
that  are  virtually  collision-free.  However,  the  Harness 
metacomputing  framework  guarantees  that  name  collisions 
will  not  compromise  the  coherency  of  the  programming 
environment.  This  guarantee  is  enabled  by  run-time 
checking  the  uniqueness  of  any  plug-in  loaded  in  the 
DVM. 

The  basic  loading  service  is  implemented  through  a 
loading  protocol  and  delivered  to  users  by  means  of  a  core 
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library.  The  core  library  is  the  main  service  access  point 
for  applications  to  a  DVM.  We  have  developed  two 
different  versions  of  the  core  library:  a  pure  Java,  object 
oriented,  Remote  Method  Invocation  (RMI)  based  one, 
and  a  generic,  loosely  typed,  socket  based  one.  The  first 
version  of  the  core  library  takes  full  advantage  of  the 
object  oriented  features  of  the  Java  programming  language 
as  well  as  of  the  Object  Serialization  mechanism  and  of 
the  RMI  capabilities.  The  second  version  of  the  core 
library  has  been  designed  to  allow  legacy  code  based 
applications  and,  more  in  general,  non  Java-based 
applications,  to  setup  and  adapt  the  environment  by  means 
of  the  basic  loading  service.  In  this  second  version  of  the 
library  all  the  data  exchanged  between  the  application  and 
the  kernel  have  been  demoted  to  be  strings  of  characters 
and  the  core  library  takes  care  of  marshalling  and  un¬ 


marshalling  the  parameters  according  to  the  requirements 
of  the  language  of  the  application.  Currently,  only  the 
object  oriented  pure  Java  version  of  the  core  library  has 
been  fully  developed,  however,  a  test  implementation  of 
the  generic  version  has  been  developed  in  Java  to  test  and 
debug  the  generic  socket  interface  in  the  kernel. 

The  core  library  provides  access  to  the  only  fixed 
service  access  point  of  the  Harness  system,  namely  the 
functions: 

•  public  H_RMIcore (String  DVMName) ; 

•  public  void  HC^RegisterUser (String 
username.  String  password,  H_pname 
serviceMapper,  H^pname  RCmapper) ; 

•  public  H_RetVal  HC_Load ( H_pname 
theServiceName ,  H_crname [ ] 
theComputationalResourcesNames , 
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Figure  3  Three  phases  commit  scheme 


H_QoS  theQoS) ; 

•  public  H_RetVal 

HC_Get Inter faceDescr iptor (H_Handle 
pluginHandle,  H_CR 
computationalResource) 

•  public  H_Info  HC_GetInf o ( ) ; 

The  constructor  requires  a  DVM  name  and  performs 
the  actions  and  control  described  in  the  section  dedicated 
to  the  DVM  startup  sequence.  The  HC_RegisterUser 
function  requires  a  username,  a  password,  a  service  to 
plug-in  mapper  plug-in  name  and  a  computational 


resources  to  Internet  hosts  mapper  plug-in  name.  If  any  of 
the  mappers  is  null  the  system  will  user  the  default 
mapping  for  any  other  command  otherwise  the  designated 
mapper  is  loaded  and  registered  for  the  user.  User  names 
are  associated  to  sets  of  capabilities.  These  sets  of 
capabilities  are  local  to  each  computational  resource,  thus 
each  user  can  have  different  privileges  granted  on  each 
computational  resource  enrolled  in  the  DVM.  By  default  a 
user  name  is  associated  to  the  set  of  capabilities  of  user 
nobody  unless  his  user  name  is  equal  to  the  owner  of  the 
local  Harness  kernel.  In  the  latter  case  he  is  associated 


67 


with  set  of  capabilities  of  user  root.  The  DVM  stores  the 
couples  login  name  password  and  keep  them  for  the  whole 
life  of  the  DVM.  Thus  a  user  can  detach  from  the  DVM 
and  come  back  to  a  later  time:  as  long  as  he  is  able  to 
provide  his  password  the  DVM  will  grant  him  his  set  of 
capabilities. 

The  HC_Load  function  allows  a  user  to  add  the  named 
service  on  the  named  computational  resources.  If  the  user 
issuing  the  command  has  registered  mappers  for  service 
and  computational  resources  names  those  mappers  will  be 
used,  otherwise  the  flat  mapping  will  be  performed.  The 
operation  is  performed  with  the  specified  Quality  of 
Service  (QoS).  The  Harness  computational  framework 
supports  four  different  QoSs  that  are  generated  as  the 
combination  of  two  parameters: 

•  all  or  none  vs.  at  least  one; 

•  two  phases  commit  vs.  completed  execution 
guaranteed  (three  phases  commit); 

A  load  command  issued  with  an  all  or  none  QoS  will 
succeed  if  and  only  if  all  the  required  computational 
resources  are  able  and  willing  to  perform  the  operation.  A 
load  command  issued  with  an  at  least  one  QoS  will  fail  if 
and  only  if  all  the  required  computational  resources  are  not 
able  or  not  willing  to  perform  the  operation. 

A  load  command  issued  with  the  two  phases  commit 
QoS  will  be  performed  according  to  the  scheme  described 
in  figure  2.  This  QoS  does  not  guarantee  that  at  the  time 
the  call  returns  all  the  kernels  have  performed  it,  it  merely 
guarantees  that  DVM  status  changing  actions  are  not  taken 
until  they  have  been  confirmed. 

A  load  command  issued  with  the  completed  execution 
guaranteed  QoS  will  be  performed  according  to  the 
scheme  shown  in  figure  3.  This  QoS  guarantees  that  at  the 
time  a  user  command  returns  all  the  kernels  have 
performed  the  requested  action.  A  kernel  failing  to  reply 
with  the  done  message  in  reasonable  time  will  cause  the 
originating  kernel  to  time-out  and  automatically  generate  a 
done  message. 

In  both  cases,  time-outs  guarantee  that  a  user  will  not 
wait  indefinitely  for  a  command  to  complete. 

3.2.5  The  Harness  classloader  and  the  guaranteed 
uniqueness  of  loaded  classes 

In  order*  to  be  able  to  retrieve  classes  from  network 
repositories  and  from  other  computational  resources 
enrolled  in  a  DVM  we  developed  a  special  classloader. 
This  classloader  performs  several  steps  to  look  for  a  class 
enlarging  at  each  step  the  scope  of  the  search.  These  steps 
are: 

•  class  loader  cache; 

•  local  file  system; 

•  local  file  systems  of  any  computational  resource  in  the 
DVM; 

•  the  whole  set  of  repositories. 


If,  after  these  steps,  the  class  loader  has  not  found  the 
required  class  then  the  loading  process  fails. 

Although  the  loading  and  checking  of  new  classes 
process  is  cumbersome  and  it  requires  distributed 
processing  the  framework  executes  it  only  when  a  class  is 
referenced  for  the  first  time  on  a  computational  resource. 
Thus  it  does  not  represent  a  bottleneck  for  computation  but 
merely  a  startup  overhead. 

The  HC_GetInterfaceDescriptor  function  returns  an 
instance  of  a  class  containing  all  the  data  necessary  to  an 
application  to  start  using  the  service  whose  handle  and 
providing  computational  resource  have  been  passed  as 
parameters. 

The  HC_GetInfo  function  returns  an  instance  of  the 
H_Info  class.  This  class  can  be  manipulated  to  get  all  the 
information  about  the  status  of  the  DVM.  It  is  important  to 
notice  that  an  instance  of  the  HJnfo  class  is  not  an  active 
entity,  it  is  not  automatically  updated  and  it  only  reflects 
the  status  of  the  DVM  at  the  moment  it  was  created.  On 
the  other  side  the  framework  provides  as  a  base  service  the 
capability  to  require  automatic  notification  of  DVM  status 
changes  as  events  and  these  events  might  be  applied  to  the 
H„Info  object  to  keep  it  up  to  date. 

Each  class  that  is  loaded  into  a  DVM,  with  the  only 
excefiption  of  the  classes  belonging  to  the  standard  Java 
distribution,  is  checked  for  uniqueness  over  the  whole 
DVM.  This  check  is  performed  using  the  DVM  status 
server  as  the  ordering  authority  of  each  class  loading 
operation.  Each  time  a  class  loader  tries  to  load  a  class  it 
issues  the  CRC32  of  the  byte  array  representing  the  class 
to  the  DVM  status  server.  If  the  DVM  server  has  stored  a 
different  value  for  the  CRC32  of  the  named  class  the  class 
loader  will  receive  a  deny  message  and  the  loading  process 
will  fail.  If  the  class  is  not  present  yet  in  the  DVM  or  if  the 
CRC32  of  the  class  that  is  being  loaded  is  equal  to  the  one 
registered  by  the  DVM  server  the  class  loader  is  allowed 
to  load  the  class  and  store  its  CRC  in  a  local  table.  The 
DVM  status  server  ensures  that  no  races  can  occur,  in  fact 
it  serializes  all  the  checks.  This  centralized  table  does  not 
represents  a  single  point  of  failure,  in  fact,  every 
computational  resource  stores  locally  every  verified  CRC. 
Thus,  in  case  of  a  DVM  status  server  crash,  it  is  possible 
to  reconstruct  the  table  of  all  the  CRC  of  all  the  classes 
loaded  in  the  whole  DVM  by  computing  the  union  of  the 
tables  of  all  the  computational  resources  enrolled  in  the 
DVM. 

3.2.6  Reconstruction  of  the  DVM  status  after  a 
DVM  status  server  crash:  the  regeneration 
protocol 

The  DVM  status  server  issues  periodic  “Fm  alive” 
messages  on  the  DVM  multicast  channel.  Each  kernel 
checks  the  timely  arrival  of  these  messages  to  be  sure  the 
server  is  up  and  running.  If  a  kernel  misses  two 
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Figure  4  Snapshot  of  the  Harness  Distributed  Virtual  Machine  Status  Display  application.  The  display  shows  every 

computational  resource  enrolled  in  the  DVM,  all  the  plug-ins  currently  loaded  and  the  services  each  plug-in 
provides. 

consecutive  “Fm  alive”  messages  it  tries  to  ping  the  server 
on  the  reliable  TCP  connection,  if  there  is  no  answer  it 
starts  the  regeneration  procedure.  During  the  regeneration 
procedure  applications  can  continue  to  use  available 
services.  However,  no  change  to  the  DVM  status  is 
possible  and  the  services  cannot  load  new  classes. 

Each  kernel  starts  the  regeneration  procedure  by 
sending  packets  on  the  DVM  multicast  channel  to  notify  to 
everyone  else  its  presence.  Each  kernel  compares  the  list 
of  received  packets  received  with  the  list  of  computational 
resources  enrolled  in  the  DVM  at  the  time  the  crash 
occurred.  After  three  rounds  of  packet  exchanges  every 
kernel  selects  the  new  candidate  for  the  DVM  status  server 
according  to  a  metric  allowing  no  ties,  the  currently 
adopted  metric  is  simply  the  lowest  Internet  address.  The 
chosen  candidate  spawns  a  new  DVM  status  server  and 
every  kernel  joins  back  the  DVM.  When  every  surviving 
kernel  has  joined  back  the  DVM  the  reconstruction  of  the 
status  is  complete  with  the  only  possible  loss  of  the  status 
of  crashed  kernels, 

4  Example  service,  application  and 
programming  constraints 

The  single  inheritance  constraint  imposed  by  the  Java 
language  makes  the  use  of  abstract  classes  extremely 
restrictive.  As  a  matter  of  fact  this  constraint  prevents 
users  from  taking  advantage  of  capabilities  provided  by 
classes  if  those  classes  were  not  taken  into  account  at  the 
time  the  abstract  class  was  defined.  For  this  reason  in  the 
Harness  metacomputing  framework  we  have  adopted  Java 
Interfaces  as  the  mechanism  of  choice  to  define  the  way  in 
which  applications  can  access  services. 

The  execution  of  an  application  in  the  Harness 
metacomputing  framework  can  be  divided  in  two  separate 
phases:  environment  set-up  and  actual  execution.  The 
setup  phase  can  be  further  divided  in  three  sub  phases: 
compatibility  check,  service  loading  and  connection  to 


service  access  points.  In  the  compatibility  check  sub¬ 
phase  an  application  controls  that  the  DVM  is  able  to 
deliver  the  specialized  services  it  needs.  In  the  service¬ 
loading  sub-phase  the  application  requests  the  loading  of 
the  plug-ins  that  implement  the  services  it  needs.  In  this 
sub-phase  the  application  can  specify  plug-ins  names,  thus 
posing  constraints  on  the  actual  implementation  of  the 
services,  or  it  can  inquire  the  system  for  default  plug-ins 
for  the  needed  services.  The  default  implementation  of  the 
service  access  point  connection  sub-phase  consists  of 
requesting  to  the  DVM  references  to  remote  Java  objects 
that  will  be  accessed  through  Java  RMI.  However,  Java 
RMI  is  not  the  only  service  access  mechanism  adopted  in 
the  Harness  metacomputing  framework.  As  a  matter  of 
fact  legacy  application  can  access  legacy  enabled  Harness 
services  through  standards  Internet  sockets.  This 
mechanism  allows  the  porting  of  legacy  applications  to  the 
Harness  system  by  implementing  the  necessary  plug-ins 
and  the  code  for  the  setup  phase. 

Once  the  application  has  completed  the  environment 
setup  it  can  start  the  actual  execution  phase.  In  this  phase, 
the  application  accesses  the  services  provided  by  the  DVM 
through  the  service  access  points  set  up  in  the  previous 
phase,  thus  it  is  completely  independent  from  the  service 
implementation. 

As  a  first  example  of  a  Harness  service  we  adopt  the 
Synchronous  DVM  event  notification  service.  This  service 
consists  of  the  capability  for  any  application  or  other 
service  to  register  as  interested  in  DVM  event  and  request 
to  be  notified  of  their  occurrence.  DVM  events  include  the 
joining  and  leaving  (or  crashing)  of  any  computational 
resource  as  well  as  the  loading  of  any  service.  The 
synchronous  DVM  event  notification  service  does  not 
allow  users  to  register  callbacks,  it  requires  users  to 
actively  request  the  next  available  event  with  a  blocking 
call.  A  callback  capable  version  of  the  service  targeted  to 
single-threads  users  can  be  easily  layered  on  top  of  this 
service  and  implemented  using  the  multi-threaded 
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public  class  Display2 
{ 

static  public  main (String [ ]  argv) 

{ 

h  =  new  H_RMIcore(argv[0]); 
h. login (argv[l] /  argv[2]); 

HDVMDi splay  theD  =  new  HDVMDisplay ( ) 

try 
{ 


H_cmaine[]  targets  =  new  H_crname  [  1  ]  ; 

targets[0]  =  new  H^crname (InetAddress .getLocalHost ( ) .getHostName ( ) ) ; 

retval  =  h.load{new  H_pnaine(  "edu.emory.mathcs. harness. H__SNot if lerlmpl  )/ 

targets,  new  H_QoS ( "ALL" ) ) ; 

H_Notifier  theN  =  {H_Notifier) retval .references [0] ; 
mylD  =  myN.H_register{)  ; 

H_Info  theinfo  =  h.getInfoO; 
theD. setCurrent (theinfo) ; 


17. 

18. 

19. 

20. 
21. 
22. 

23. 

24. 

25. 

26. 
27.  } 


Figure  5 


while (true) 

{ 

theD . update ( theN. getNext ( ) ) ; 

} 

} 

catch (Exception  e) 

{ 

System. er r.print In (e) ; 

} 


Java  code  for  the  main  class  of  the  Harness  Distributed  Virtual  Machine  Status  Display  application. 


architecture  of  the  JVM. 

The  Harness  metacomputing  framework  defines  two 
Java  interfaces  related  to  the  synchronous  event 
notification  service:  the  H_Notifier  interface  and  the 
H_^INotifier  interface.  The  first  one  defines  the  set  of 
functions  users  can  use  to  interact  with  the  service.  The 
second  Java  interface  is  the  one  that  a  plug-in  needs  to 
implement  in  order  to  have  the  system  dispatch  the  events 
to  it.  The  plug-in  guarantees  that  all  the  DVM  events 
occurring  in  the  DVM  after  the  user  registration  will  be 
queued  exactly  in  the  order  in  which  they  occurred  and 
delivered  in  that  order  on  demand. 

We  have  used  this  service  to  develop  a  simple  DVM 
status  display  application.  In  figure  4  you  can  see  a 
snapshot  of  the  display  offered  by  this  application,  while 
figure  5  shows  part  of  the  application  code.  Lines  5  to  13 
represents  the  set-up  phase,  while  the  actual  execution 
phase  consists  of  lines  14  to  20. 

As  any  Harness  application  the  display  need  first  to 
instance  a  new  Harness  core  library  in  order  to  connect  to 
a  DVM  (see  line  number  5).  Then  it  logs  into  the  DVM 
providing  a  user  name  and  a  password  (see  line  6).  In  line 
7  the  applications  generates  an  instance  of  a  graphical 
status  display  class.  This  application  needs  no  specialized 
services,  thus  there  is  no  compatibility  check  sub-phase.  In 


line  10  to  12  the  application  executes  the  services  loading 
sub-phase,  by  loading  a  plug-in  implementing  the  notifier 
service  into  the  DVM.  The  service  access  points 
connection  sub-phase  consists  of  line  13  alone,  in  fact  the 
application  retrieves  a  reference  to  the  service  and  stores  it 
into  a  variable. 

In  line  14  the  application  registers  itself  as  a  recipient 
for  system  events,  then  in  line  15  gets  the  current  status  of 
the  system  and  in  line  16  sets  it  into  the  display.  Lines  17- 
20  loop  endlessly  to  get  new  events  and  update  the  display 
accordingly. 

This  simple  application  clearly  shows  the  main 
characteristics  of  Harness  applications: 

•  the  independence  of  the  application  code  from  the 
service  implementation; 

•  the  clear  separation  of  set-up  phase  from  execution 
phase. 

5  Related  works 

Metacomputing  frameworks  have  been  popular  for 
nearly  a  decade,  when  the  advent  of  high  end  workstations 
and  ubiquitous  networking  in  the  late  80's  enabled  high 
performance  concurrent  computing  in  networked 
environments.  PVM  [9]  was  one  of  the  earliest  systems  to 
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formulate  the  metacomputing  concept  in  concrete  virtual 
machine  and  programming-environment  terms,  and 
explore  heterogeneous  network  computing.  PVM  is  based 
on  the  notion  of  a  dynamic,  user-specified  host  pool,  over 
which  software  emulates  a  generalized  concurrent 
computing  resource.  Dynamic  process  management 
coupled  with  strongly  typed  heterogeneous  message 
passing  in  PVM  provides  an  effective  environment  for 
distributed  memory  parallel  programs.  PVM  however,  is 
inflexible  in  many  respects  that  can  be  constraining  to  the 
next  generation  of  metacomputing  and  collaborative 
applications.  For  example,  multiple  DVM  merging  and 
splitting  is  not  supported.  Two  different  users  cannot 
interact,  cooperate,  and  share  resources  and  programs 
within  a  live  PVM  machine.  PVM  uses  Internet  protocols 
which  may  preclude  the  use  of  specialized  network 
hardware.  The  Harness  “plug-in”  paradigm  effectively 
alleviates  these  drawbacks  while  providing  greatly 
expanded  scope  and  substantial  protection  against  both 
rigidity  and  obsolescence. 

Legion  [10]  is  a  metacomputing  system  that  began  as 
an  extension  of  the  Mentat  project.  Legion  can 
accommodate  a  heterogeneous  mix  of  geographically 
distributed  high-performance  machines  and  workstations. 
Legion  is  an  object  oriented  system  where  the  focus  is  on 
providing  transparent  access  to  an  enterprise-wide 
distributed  computing  framework.  As  such,  it  does  not 
attempt  to  cater  to  changing  needs  and  it  is  relatively  static 
in  the  types  of  computing  models  it  supports  as  well  as  in 
implementation. 

Globus  [11]  is  a  metacomputing  infrastructure  which  is 
built  upon  the  “Nexus”  [12]  multi-language 
communication  framework.  The  Globus  system  is 
designed  around  the  concept  of  a  toolkit  that  consists  of 
the  pre-defined  modules  pertaining  to  communication, 
resource  allocation,  data,  etc.  However  the  assembly  of 
these  modules  is  not  supposed  to  happen  dynamically  at 
run-time  as  in  Harness.  Besides,  the  modularity  of  Globus 
remains  at  the  metacomputing  system  level  in  the  sense 
that  modules  affect  the  global  composition  of  the 
metacomputing  substrate. 

Sun  Microsystems  Jini  project  [13]  presents  a  model 
where  a  federation  of  Java  enabled  objects  connected 
through  a  network  can  freely  interact  and  deliver  services 
to  each  other  and  to  end  users.  In  principle  Jini  shares  with 
Harness  many  keywords  and  goals,  such  as  services  as 
building  blocks  and  hetereogeneity  of  service  providers. 
However,  Jini  focuses  on  the  capability  to  build  up  a  world 
of  plug-and-play  consumer  devices  and,  to  cope  with  such 
a  goal,  increase  the  resolution  of  the  computational 
resources  that  can  be  enrolled  in  a  Jini  federation.  In  fact 
in  the  Jini  model  these  resources  range  from  complete 
computational  systems  down  to  devices  such  as  disks, 
printers,  TVs  and  VCRs. 

The  CORE  A  Object  Management  Architecture  [14] 


provides  a  model  to  which  Object  Requests  Brokers  of 
different  vendors  can  refer  in  order  to  seamlessly  interact, 
this  generality  is  achieved  defining  protocols  and 
interfaces  to  be  used  by  Object  Request  Brokers.  However 
the  focus  of  CORBA  is  upon  interoperability  of  service 
providers  while  the  focus  of  our  project  is  mainly  on 
building,  managing  and  dynamically  reconfiguring  such 
service  providers.  Besides,  a  large  part  of  the  CORBA 
system  is  dedicated  to  overcome  the  problems  related  to 
have  a  multi-language  system,  while  Harness  completely 
delegates  this  issues  to  Java  and  the  JNI. 

The  CONDOR  [15]  project  has  demonstrated  the 
usefulness  of  a  flexible  approach  to  the  problem  of 
resource  gathering.  However,  the  CONDOR  project  does 
not  envision  dynamic  reconfigurability  of  the  set  of 
services  provided  by  the  computational  resources. 

Almost  all  the  above  projects  envision  a  model  in 
which  very  high  performance  bricks  are  statically 
connected  to  build  a  larger  system.  One  of  the  main  idea 
of  the  Harness  project  is  to  trade  some  efficiency  to  gain 
enhanced  global  availability,  upgradability  and  resilience 
to  failures  by  dynamically  connecting,  disconnecting  and 
reconfiguring  heterogeneous  components.  Harness  is  also 
seen  as  a  research  tool  for  exploring  pluggability  and 
dynamic  adaptability  within  DVMs. 

6  Concluding  remarks 

In  this  paper  we  have  described  the  Harness 
metacomputing  framework.  The  main  feature  of  this 
system  is  its  capability  to  allow  users  to  build,  reconfigure, 
use  and  dismantle  Distributed  Virtual  Machines  (DVM). 
Harness  dynamically  enrolls  and  reconfigures 
heterogeneous  computational  resources  into  DVMs  in 
order  to  provide  a  consistent  service  baseline  to  the  users. 
This  fundamental  characteristic  of  Harness  is  intended  to 
address  the  inflexibility  of  current  metacomputing 
frameworks  as  well  as  their  incapability  to  incorporate 
new  technologies  and  avoid  rapid  obsolescence.  These 
results  are  achieved  without  compromising  the  coherency 
of  the  programming  environment  by  means  of  a 
distributed  plug-in  mechanism,  a  high  level  of  code 
portability  and  a  fault-tolerant  dynamic  status  management 
mechanism. 

In  this  paper  we  have  shown  that  the  current  prototype 
of  Harness  is  able: 

•  to  define  services  in  term  of  abstract  Java  interfaces  in 
order  to  have  them  updated  in  a  user  transparent 
manner; 

•  to  adapt  to  changing  user  needs  by  adding  new 
services  to  heterogeneous  computational  resources  via 
the  plug-in  mechanism; 

•  to  cope  with  network  and  hosts  failures  with  a  limited 
amount  of  overhead; 

Although  the  model  for  security  and  access  control  has 
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been  defined  and  briefly  described  in  this  paper,  the 

current  prototype  does  not  implement  it.  This  feature  will 

be  included  in  future  releases. 
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Abstract 

A  parallel  programming  system,  called  MPC++,  pro¬ 
vides  parallel  primitives  such  as  a  remote  function  invoca¬ 
tion,  a  global  pointer,  and  a  synchronization  structure  using 
the  C++  template  feature.  The  system  has  run  on  a  clus¬ 
ter  of  homogeneous  computers.  In  this  paper,  the  runtime 
system  is  extended  to  run  on  a  cluster  made  up  of  hetero¬ 
geneous  computers  environment.  Unlike  other  distributed 
or  parallel  programming  systems  on  heterogeneous  com¬ 
puters,  the  same  program  on  the  homogeneous  environment 
runs  on  the  heterogeneous  environment  in  this  extension. 


1.  Introduction 

Most  parallel  programming  language  systems  support 
not  only  communication  between  computers  but  also  sup¬ 
port  remote  function  invocation  and  other  parallel  primi¬ 
tives.  If  such  a  system  is  designed  to  run  on  homogeneous 
as  well  as  a  heterogeneous  computers  environment,  some 
issues,  e.g.,  data  conversion  and  the  address  of  a  remote 
function,  must  be  solved.  There  are  two  approaches:  One 
is  to  provide  a  parallel  programming  library.  Whenever  a 
remote  function  invocation  is  used,  libraries  of  data  con¬ 
version  and  marshalling  arguments  are  called,  followed  by 
calling  a  remote  function  invocation  library.  Since  such  li¬ 
braries  are  called  at  every  remote  function  invocation,  re¬ 
gardless  of  whether  or  not  the  receiver  is  the  same  machine 
type,  it  has  an  overhead  when  the  sender  and  receiver  are 
the  same  machine  type. 

The  other  one  is  to  design  a  new  parallel  programming 
language.  This  supports  good  programming  facilities,  but 
requires  the  building  of  a  compiler  to  generate  efficient 
code  for  both  the  homogeneous  and  heterogeneous  environ¬ 
ments.  However,  the  users  need  to  learn  the  new  language, 
which  is  usually  not  accepted. 

MPC++  provides  parallel  primitives  such  as  remote 
function  invocation,  a  global  pointer,  and  a  synchronization 
structure  using  the  C++  template  feature  without  any  exten¬ 


sions  to  C++.  The  system  assumes  an  SPMD  programming 
model  where  the  same  program  runs  but  uses  different  data 
on  computers.  The  runtime  system  is  extended  to  run  on 
the  heterogeneous  computers  environment  without  chang¬ 
ing  any  existing  MPC++  library  specifications.  The  MPC++ 
new  runtime  system  solves  disadvantages  of  the  library  ap¬ 
proach  to  supporting  both  the  homogeneous  and  heteroge¬ 
neous  environments. 

In  the  following  sections,  an  overview  of  MPC++  is  first 
described  in  section  2.  Section  3  briefly  discusses  issues 
on  the  heterogeneous  environment.  The  design  and  imple¬ 
mentation  in  the  heterogeneous  environment  is  presented  in 
section  4.  Section  5  presents  preliminary  evaluation  results 
using  two  Sun  Sparc  Station  20s,  one  Intel  Pentium,  and 
one  Compaq  Alpha.  Related  work  is  described  in  section  6. 
Finally,  we  conclude  this  paper  in  section  7. 

2.  MPC++ 

The  MPC++  program  is  assumed  to  run  on  a  distributed 
memory-based  parallel  system.  Program  code  is  distributed 
to  all  physical  processors  and  a  process  for  the  program  runs 
on  each  processor. 

To  support  parallel  description  primitives  in  MPC++ 
Version  2.0,  the  MPC++  multiple  threads  template  library 
(called  MTTL  in  short),  realized  by  C++  templates,  has 
been  designed  and  implemented[5].  It  contains  i)  in¬ 
voke  and  ainvoke  function  templates  for  synchronous 
and  asynchronous  local/remote  thread  invocation,  ii)  Sync 
class  template  for  synchronization  and  communication 
among  threads,  iii)  Global Ptr  class  template  for  pointer 
to  remote  memory,  and  iv)  yield  function  to  suspend 
thread  execution  and  yield  another  thread  execution.  In  this 
paper,  the  invocation  and  global  pointer  mechanisms  are  de¬ 
scribed. 

2.1.  invoke/ainvoke 

The  invoke  function  template  allows  us  to  invoke  a  re¬ 
mote  function  which  involves  creation  of  a  new  thread  on 


0-7695-0107-9/99  $10.00  ©  1999  IEEE 


73 


the  remote  processor. 

The  invoke  function  template  has  two  formats,  one  for 
a  function  returning  a  value  and  one  for  a  void  function. 
The  former  invoke  format  takes  i)  a  variable  where  the 
return  value  is  stored,  ii)  the  processor  number  on  which  a 
function  is  invoked,  iii)  a  function  name,  and  iv)  its  argu¬ 
ments.  The  latter  invoke  takes  i)  the  processor  number  on 
which  a  function  is  invoked,  ii)  a  function  name,  and  iii)  its 
arguments. 

The  following  example  shows  that  a  f  oo  function  is  in¬ 
voked  on  processor  1.  The  execution  of  the  mpc_main 
thread  is  blocked  until  f  oo  function  execution  is  termi¬ 
nated.  After  the  end  of  f  oo  function  execution,  the  return 
value  is  stored  in  variable  i  and  then  mpc  jnain  thread  ex¬ 
ecution  is  resumed.  A  void  function  is  invoked  on  proces¬ 
sor  2  in  line  9.  After  the  execution  of  the  bar  function  is 
finished,  mpc  jnain  thread  execution  is  resumed. 

1  # include  <itipcxx.h> 

2  int  foo(int,  int) ; 

3  void  bar (int,  int) ; 

4  mpc_main() 

5  { 

6  int  i 7 

7 

8  invoke ( i ,  1 ,  f oo ,  1,  2 ) ; 

9  invoke (2,  bar,  10,  20); 

10  } 

The  ainvoke  function  template  is  provided  to  program 
asynchronous  remote/local  function  invocation.  Asyn¬ 
chronous  remote  function  invocation  means  that  a  thread 
invokes  a  remote  function  and  executes  the  subsequent  pro¬ 
gram  without  blocking.  The  thread  may  get  the  return  value 
later  using  the  synchronization  structure. 


2.2.  GlobalPtr 


Any  local  object  can  be  referred  to  using  a  global 
pointer  which  is  realized  by  the  GlobalPtr  class  tem¬ 
plate.  The  GlobalPtr  class  template  takes  one  type  pa¬ 
rameter  which  represents  the  type  of  the  storage  pointed  to 
by  the  global  pointer.  The  operations  on  an  GlobalPtr 
object  are  almost  the  same  as  a  regular  pointer  object  ex¬ 
cept  that  a  global  pointer  of  a  global  pointer  is  not  allowed. 

A  simple  example  is  shown  below.  The  f  oo  function 
takes  a  global  pointer  and  a  value  as  the  parameters  and 
saves  the  value  into  the  storage  pointed  to  by  the  global 
pointer.  The  remote  foo  function  is  invoked  in  line  16. 
A  global  pointer  gp  is  initialized  in  15  where  the  address  of 
gl  on  processor  #0  is  set.  In  line  17,  a  global  pointer  gp 
points  to  the  address  of  gl  on  processor  #2  using  the  set 
method  of  the  GlobalPtr  object. 


1  # include  <mpcxx.h> 

2  int  gl 7 

3  void  foo(GlobalPtr<int>  gp,  int  val) 

4  { 

5  *gp  =  val; 

6  } 

7  void  bar (GlobalPtr<int>  gp) 

8  { 

9  printf (" [Processor  %d]  *gp  =  %d\n” , 

10  myNode,  (int)  *gp) ; 

11  } 

12  inpc_main(int ,  char**) 

13  { 

14  GlobalPtr<int>  gp; 

15  gp  =  &gl; 

16  invoke(l,  foo,  gp,  10); 

17  gp.set(&gl,  2) ; 

18  *gp  =  20; 

19  printf (" [Processor  %d]  gl  is  %d\n" , 

20  myNode,  gl) ; 

21  invoke (2,  bar,  gp) ; 

22  } 

When  the  example  is  executed,  you  see  the  following 

message: 

[Processor  0]  gl  is  10 
[Processor  1]  *gp  =  20 

Note  that  the  GlobalPtr  has  the  nwrite  and  nread 

functions  to  write/read  data  to/from  a  remote  node. 

3.  Issues 

Issues  in  the  heterogeneous  computers  environment  are 

briefly  reviewed  below. 

1.  Data  Type  and  Representation 

Of  course,  data  type  and  its  representation  are  differ¬ 
ent  on  heterogeneous  computers,  e.g.,  little  endian  vs. 
big  endian.  A  long  value  represents  64  bits  or  32  bits 
depending  on  processor  type. 

2.  Function  Address  and  Global  Scope  Data  Address 
The  MPC++  assumes  the  SPMD  programming  style 
where  the  same  program  runs  on  each  computer.  It 
means  that  the  addresses  of  a  function  and  global 
scope  data  are  the  same  locations  over  computers  in 
the  homogeneous  environment.  Using  this  feature, 
the  MPC-H+  realizes  a  light  weight  function  invoca¬ 
tion  and  global  pointer  mechanisms.  On  the  other 
hand,  though  the  program  is  compiled  on  all  ma¬ 
chines  on  the  heterogeneous  environment,  the  ad¬ 
dresses  of  a  function  and  global  scope  data  are  dif¬ 
ferent  locations  over  computers. 

3.  Global  Pointer 

A  global  pointer  enables  access  to  a  remote  memory 
location  across  different  processor  types.  If  a  pointer 
points  to  a  data  structure,  the  representation  of  that 
structure  is  different  between  computers. 
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4.  Design  and  Implementation 
4*1.  Remote  Function 

A  remote  function  invocation  mechanism  in  a  distributed 
memory-based  parallel  computer  is  usually  realized  as  fol¬ 
lows: 

1.  The  Sender: 

An  invocation  function  performs  the  following  proce¬ 
dure: 

•  A  function  address  and  its  arguments  are  packed 
into  a  request  message,  known  as  marshalling. 

•  The  request  message  is  sent  to  the  receiver. 

•  The  reply  message  is  received  and  the  result  is 
stored. 

2.  The  Receiver: 

The  receiver  has  a  dispatcher  which  performs  the  fol¬ 
lowing  procedures: 

•  A  request  message  is  unpacked,  known  as  un¬ 
marshalling.  To  unmarshal  the  message,  the 
dispatcher  needs  to  know  the  data  types  of  ar¬ 
guments  and  a  function’s  return  value. 

•  A  function,  whose  address  is  defined  in  the  mes¬ 
sage,  is  invoked. 

•  The  return  value  is  sent  back  to  the  sender. 

The  dispatcher  on  the  receiver  can  not  be  straightfor¬ 
wardly  implemented.  Of  course,  in  the  homogeneous  com¬ 
puters  environment,  the  dispatcher  receives  the  size  of  ar¬ 
guments  and  a  return  value  so  that  it  creates  a  call  frame 
without  knowing  data  types.  This  implementation  usually 
requires  assembler  code,  and  thus  the  code  is  not  portable. 

Our  approach  is  that  a  dispatcher  for  each  request  mes¬ 
sage  type  is  created  so  that  the  dispatcher  knows  all  data 
types.  In  the  rest  of  this  chapter,  first,  class  and  function 
templates  are  introduced  to  realize  a  remote  function  invo¬ 
cation  mechanism  in  the  homogeneous  computers  environ¬ 
ment.  Then  these  templates  are  modified  to  adapt  to  the 
heterogeneous  computers  environment. 

4,1.1.  Homogeneous  Computers  Environment 

First  of  all,  a  request  message  data  structure  is  defined 
by  the  C++  template  feature.  The  following  example  de¬ 
fines  a  request  message  for  two  arguments  where  a  function 
address  and  its  argument  are  contained: 

1  template<class  F,  class  Al,  class  A2> 

2  struct  regimes sage2  { 

3  F  (*func){Al,  A2); 

4  Al  al ; 

5  A2  a2  ; 

6  }; 


1  template<class  F,  class  Al,  class  A2> 

2  class  __dispatcher2  { 

3  public: 

4  static  void  invoker {)  { 

5  regimes sage2<F,  Al,  A2>  *argp; 

6  F  val ; 

7  argp  =  (req_message2<F,  Al,  A2>  *) 

8  _getArgPrimitive  ( )  ; 

9  val  =  argp- > f unc (argp- >al,  argp->a2); 

10  _remoteReturnPrimitive  (&val, 

11  sizeof (val) ) ; 

12  } 

13  }; 

Figure  1.  A  dispatcher  in  the  homogeneous 
environment 


1  template<class  F,  class  Al,  class  A2> 

2  inline  void 

3  invoke{F  &ret,  int  pe,  (F  (*f) (Al,  A2)), 

4  Al  al,  A2  a2)  { 

5  regimes sage2<F,  Al,  A2>  arg; 

6  arg . f unc  =  f ; 

7  arg.argl  =  al; 

8  arg . arg2  =  a2 ; 

9  r emote Sync InvokePr imi t i ve (pe, 

10  _dispatcher2<F,  Al,  A2 >:; invoker , 

11  &arg,  sizeof  (arg),  &:ret)  ; 

12  } 

Figure  2.  invoke  template  function  in  the  ho¬ 
mogeneous  environment 


The  dispatcher  for  the  above  request  message  on  the 
receiver  is  defined  using  the  C++  template  as  shown  in 
Figure  1.  In  this  example,  a  message  is  obtained  by  the 
.getArgPrimit  ive  runtime  routine,  and  then  a  function 
is  invoked  using  the  function  address  specified  by  the  mes¬ 
sage.  The  return  value  is  sent  to  the  sender  by  issuing  the 
_r  emo  t  eRe  turn  Pr  imi  t  ive . 

An  invoke  function  for  a  two  argument  function  is 
defined  in  Figure  2.  A  message  is  constructed  using  the 
req_message2  template,  and  then  the  remoteSync- 
InvokePrimitive  is  invoked  to  send  the  message  to 
the  sender  specified  by  pe.  When  the  receiver  receives 
the  message,  the  instantiated  function  of  invoker  of  class 
-dispatcher2<F ,  Al ,  A2>  is  invoked. 

It  should  be  noted  that  templates  reqjGnessage2  and 
>dispatcher2  are  instantiated  when  the  invoke  tem¬ 
plate  is  instantiated.  An  example  of  the  invoke  template 
function  usage  is  shown  below  and  its  compiled  code  is 
shown  in  Figure  3. 
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1  extern  int  foo(int,  double); 

2  mpc_jnain() 

3  { 

4  int  ret; 

5  req_message2<int,  double>  arg; 

6  arg . f unc  =  f ; 

7  arg.argl  =  al; 

8  arg . arg2  =  a2 ; 

9  remoteSyncInvokePrimitive (pe,  _dispatcher2<int,  double> :: invoker , 

10  &arg,  sizeof(arg),  &ret) ; 

11  } 

12  struct  req__message2  { 

13  int  (*func) (int,  double); 

14  int  al; 

15  double  a2; 

16  }; 

17  class  _dispatcher2  { 

18  public: 

19  static  void  invoker ( )  { 

20  req_message2<int,  double> 

21  *argp  =  (req_message2<int,  double>  *)  _getArgPriinitive ( ) ; 

22  int  val; 

23  val  =  argp->func (argp->al ;  argp->a2) ; 

24  _remoteReturnPriinitive  (Scval,  sizeof  (val)  )  ; 

25  } 

26  } 


Figure  3.  An  instantiation  of  req.message2  and  -dispatcher2 


1  extern  int  foo (double); 

2  mpc_inain() 

3  { 

4  int  ret; 

5  invoke (ret,  1,  foo,  5,  2.0); 

6  } 

An  invoke  template  function  call  in  Line  5  of 
the  above  example  is  expanded  to  lines  5  to  10  in 
Figure  3.  This  expansion  involves  the  instantiation 
of  the  req_nnessage2<int ,  double>  class  and 
.dispatcher 2  class  shown  in  the  figure. 

4.1.2.  Heterogeneous  Computers  Environment 

To  adapt  the  runtime  system  to  the  heterogeneous  envi¬ 
ronment,  an  initialization  routine  is  introduced,  and  then  the 
templates  described  before  are  extended.  Figure  4  is  used 
as  a  heterogeneous  environment  example  which  consists  of 
two  Sun  Sparc  machines  running  SunOS  which  are  desig¬ 
nated  as  PEs  0  and  1,  one  Intel  Pentium  machine  running 
Linux  referred  to  as  PE  2,  and  one  Compaq  Alpha  machine 
running  Linux  referred  to  as  PE  3. 

.  Initialization  and  Primitives 

An  executable  file  has  not  only  code  but  also  has  a  sym¬ 
bol  name  table  in  which  function  and  global  data  symbols 
and  their  addresses  are  stored.  When  a  process  of  a  MPC++ 
program  is  spawned  on  each  computer,  each  process  reads 


the  executable  file  in  order  to  read  all  symbols  information 
and  set  up  a  name  table  for  the  function  and  global  scope 
symbols  and  their  addresses.  Then,  the  processor  type  is 
passed  to  the  other  remote  processes  so  that  each  process 
has  the  processor  type  table  as  shown  in  Figure  4. 

Let  us  define  three  functions 
-isHomoEnv,  jnarshalName,  and  .unmarshalName. 
The  .isHomoEnv  function,  whose  argument  takes  a  pro¬ 
cessor  number,  returns  boolean  true  if  the  computer  speci¬ 
fied  by  the  argument  is  of  the  same  processor  environment 
as  the  local  processor.  The  jnarshalName  function  takes 
two  arguments,  a  buffer  address  and  a  function  address  or  a 
global  scoped  data  address.  It  stores  the  symbol  name  of  the 
address  into  the  buffer.  The  .unmarshalName  function 
takes  a  buffer  address  which  has  a  symbol  name.  It  returns 
the  address  of  the  symbol  name  registered  in  the  name  table. 

.  invoke  template  function 

The  invoke  template  function  shown  in  Figure  2  is 
modified  for  the  heterogeneous  environment  as  shown  in 
Figure  5.  If  the  receiver  is  the  same  processor  type,  the  ho¬ 
mogeneous  invocation  mechanism  is  used  in  line  4.  Oth¬ 
erwise,  the  heterogeneous  invocation  mechanism  is  per¬ 
formed  as  shown  in  lines  6  to  15. 

Unlike  the  homogeneous  case,  a  dispatcher  function  for 
the  heterogeneous  environment  is  also  converted  to  the  sym¬ 
bol  name  in  line  9.  The  dispatcher  hinvoker  function  of 
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ToPE#3  _ 


“_hinvoket2_t12_dispatcher2id”,  “_foo_Rd”,  10, 2.0 


invoke(^^^<5r5, 2.0); 

invoke(3rfoo,  5, 2.0); 

Hostihfo 


PE#0  SunOS  Sparc 
PE#1  SunOS  Sparc 
PE#2Unux  i386 
PE#3  Linux  Alpha 

Name  table  from  a.out 

00002290  _foo_Fid 
000022c4_main 
00004010  _hinvoker2_., 


void  foo(int,  double) 
{-} 

tiostinfo _ 

PE#0  SunOS  Sparc 
PE#1  SunOS  Sparc 
PE#2  Unux  mQ 
PE#3  Linux  Alpha 


void  foo(int,  double) 

{-} _ 

Host  Info _ 

PE#0  SunOS  Sparc 
PE#1  SunOS  Sparc 
PE#2Unux  i^6 
PE#3  Linux  Alpha 


Name  table  from  a.out  Name  table  from  a.out  Name  table  from  a.out 


08048568  _foo_Fid 
08048580  .main 
08048800  hinvoker2_ 


08048568  Joo_Fid 
08048580  .main 
08048800  .hinvokei2.... 


void  foo(int,  double) 


Host  Info 

PE#0  SunOS  Spa 
PE#1  SunOS  Spa 
PE#2  Linux  i386 
PE#3  Linux  Alphi 


0120006f8  Joo_Fid 
012000738  _main 
012002000  hinvoker2. 


Figure  4.  An  Execution  Environment  Exampie 


1  teinplate<class  F,  class  Al> 

2  inline  void  invoke(F  &ret,  int  pe,  (F  (*f) (Al)),  A1  al,  A2  a2)  { 

3  if  (_isHomoEnv(pe) )  { 

4  //  Same  as  in  Figure  2 

5  }  else  { 

6  char  buf [MAX_REQMSG] ; 

7  char  ♦argp; 

8 

9  argp  =  _marshalName (buf ,  _dispatcher2<F,  Al,  A2> :  :hinvoker)  ; 

10  argp  =  _marshalName  (argp,  f )  ; 

11  argp  =  _marshal (argp,  al) ; 

12  argp  =  ^marshal (argp,  a2); 

13  remoteSyncInvokePrimitive (pe,  0, 

14  buf,  argp  -  buf,  &buf ) ; 

15  _unmarshal  (buf ,  val)  ; 

16  } 

17  } 

Figure  5.  invoke  template  function  in  the  heterogeneous  environment 
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a  class  template  .dispatcher!  will  be  shown  later.  After 
marshalling  a  symbol  name  of  the  user  function,  arguments 
al  and  a2  are  marshaled  using  the  jnaarshal  function. 

The  jnarshal  function  is  an  overloaded  function.  The 
MPC++  runtime  system  defines  all  C++  basic  data  types. 
An  example  of  prototype  specifications  is  shown  below.  If 
an  argument  is  a  user  defined  data  type  or  structure,  its  mar¬ 
shal  operation  must  be  defined. 

1  caddr_t  _inarshal( caddr_t ,  int)  ; 

2  caddr_t  _inarshal( caddr_t ,  double); 

As  shown  in  line  13  of  Figure  5,  the  second  argument 
of  remoteSyncInvokePrimitive  is  zero  so  that  the 
function  knows  this  call  is  for  the  heterogeneous  environ¬ 
ment. 

The  definition  of  hinvoker  is  shown  in  Figure  6.  The 
hinvoker  i)  unmarshals  all  arguments,  ii)  invokes  a  func¬ 
tion,  iii)  marshals  the  return  value,  and  iv)  sends  a  reply 
message  to  the  sender.  The  .unmarshal  function  is  an 
overloaded  function,  the  same  as  the  jtiarshal  function. 

As  an  example  shown  in  Figure  4,  a  remote  function  in¬ 
vocation  from  PE#0  to  PE#1  is  the  same  mechanism  as  that 
the  homogeneous  environment  since  those  two  processors 
are  the  same  processor  and  operating  system.  In  the  case 
of  a  remote  function  invocation  from  PE#0  to  PE#2,  since 
the  sender  knows  that  the  remote  processor’s  integer  data 
type  representation  is  different,  the  integer  data  is  converted 
to  the  network  byte  order  in  addition  to  converting  the  ad¬ 
dresses  of  the  dispatcher  function  and  user  function  f  oo  to 
symbols.  On  the  other  hand,  in  the  case  of  a  remote  func¬ 
tion  invocation  from  PE#0  to  PE#3,  the  integer  data  is  not 
converted  since  the  integer  data  byte  order  of  both  is  the 
same. 

4.2.  Global  Pointer 

In  this  subsection,  a  Global  Pointer  in  the  hetero¬ 
geneous  environment  is  implemented  first,  and  then  it  is 
extended  to  adapt  to  the  heterogeneous  computers  environ¬ 
ment. 

4.2.1.  Homogeneous  Computers  Environment 

Figure  7  shows  that  an  implementation  of  a  GlobalPtr 
class  which  is  a  simplified  version  of  the  actual  MPC++ 
implementation.  This  implementation  is  described  with  the 
usage  of  a  GlobalPtr  object  shown  as  follows: 

1  int  gi; 

2  void  foo() 

3  { 

4  GlobalPtr<int>  gp; 

5  gp  =  &gi; 

6  i  =  *gp; 

7  *gp  =  10; 

8  } 


Table  1.  Evaluation  Environment 


Nodes 

Two  Sun  Sparc  Station  20s 
(75  MHz,  64MB,  Sun  OS  4. 1 .3) 

One  Intel  Pentium  Pro 
(200  MHz,  128MB,  NetBSD  1.2.1) 
One  Compaq  PWS  600au 
(Alpha  21164, 500  MHz,  128  MB, 
Linux  2.0.32) 

Network 

Myricom  Myrinet 

Runtime  System 

score  standalone  system 
with  PM  communication  library 

In  line  5,  an  overloaded  assignment  function  defined  in 
line  19  of  Figure  7  is  invoked  so  that  the  address  of  gi  and 
the  processor  number  are  stored  in  the  internal  data  struc¬ 
ture  Gstruct  of  the  GlobalPtr  object.  The  assignment 
expression  in  line  6  has  the  same  effect  as  the  following  ex¬ 
pression: 

i  =  (int)*gp; 

This  means  that  the  cast  operation  to  the  integer  type  is 
invoked,  led  by  the  pointer  reference  operation.  As  defined 
in  line  20  of  Figure  7,  the  pointer  reference  operation  returns 
the  Gstruct  object.  The  cast  operation  to  the  integer  of 
Gstruct  object  is  defined  in  line  8  of  Figure  7. 

4.2.2.  Heterogeneous  Computers  Environment 

In  order  to  adapt  the  GlobalPtr  facility  to  the  het¬ 
erogeneous  computers  environment,  three  cases  are  consid¬ 
ered: 

1.  The  jnarshal  and  .unmarshal  functions  for  the 
GlobalPtr  class  are  defined  so  that  a  GlobalPtr 
object  may  be  passed  to  a  remote  computer. 

2.  An  assignment  of  a  pointer  address  for  remote  data 
in  the  set  method  is  extended  so  that  the  remote  ad¬ 
dress  is  obtained.  This  requires  extra  communication. 

3.  The  remote  read/write  operations  are  extended  so  that 
data  conversion  is  performed  for  a  different  processor 
architecture. 

5.  Preliminary  Evaluation  Results 
5.1.  Environment 

The  MPC++  new  runtime  system  has  run  on  a  heteroge¬ 
neous  computers  testbed  where  two  Sun  SS20s,  one  Com¬ 
paq  PWS600au,  and  one  PC  are  connected  by  the  Myrinet 
network  as  shown  in  Table  1.  The  gnu  egcs  compiler  is 
used  to  compile  benchmark  programs. 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


teinplate<class  F,  class  Al,  class  A2> 
class  _dispatcher2  { 
public : 

static  void  invoker ( )  {  ...  } 
static  void  hinvokerO  { 

caddr_t  argp  =  _getArgPrimitive ( )  , 


F 

Al 

Al 

F 

char 


(*f) (Al,  A2); 

al; 

a2 ; 

val; 

buf [sizeof (F) ] 


argp  =  _uninarshalNaine  (argp)  , 
argp  =  _uninarshal  (argp,  al) , 
argp  =  _unmarshal(argp,  a2), 
val  =  f(al,  a2); 
argp  =  ^.marshal  (buf ,  val); 
_remoteReturnPr  imitive  (buf , 


sizeof (val) ) 


Figure  6.  A  dispatcher  in  the  heterogeneous  environment 


1  teinplate<class  T> 

2  class  GlobalPtr  { 

3  private : 

4  class  Gstruct  { 

5  public: 

6  int  pe  ; 

7  caddr_t  addr ; 

8  operator  T ( )  { 

9  T  buf; 

10  _remoteMeinRead{pe,  addr,  &buf,  sizeof (T)); 

11  return  buf;  } 

12  } 

13  void  operator=(T  t)  { 

14  _remoteMemWrite(pe,  addr,  &t,  sizeof (T)); 

15  } 

16  }; 

17  Gstruct  gval; 

18  public: 

19  Global Ptr<T>  &operator=(T  *data)  {gval.pe  =  myNode;  gval. addr  =  (caddr_t) data;  } 

20  GlobalPtr<T>: : Gstruct  &operator* ( )  {  return  gval;  } 

21  void  nwrite(T  *data,  int  nitem) ; 

22  void  set (void  *addr,  int  pn)  {  } 

23  }; 


Figure  7.  An  Impiementation  of  GiobalPtr 
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Table  2.  Remote  void  zero  arguments  func¬ 
tion  cost  (round  trip  time) 


Sender  to  Receiver 

RTT  (/^seconds) 

Sun  to  Sun 

36.0 

Sun  to  Pentium 

38.8 

Sun  to  Alpha 

36.6 

Pentium  to  Sun 

46.5 

Pentium  to  Alpha 

30.5 

Alpha  to  Sun 

44.7 

Alpha  to  Pentium 

30.5 

Figure  8.  Remote  Function  invocation 


5,2.  Remote  Function  Invocation  Cost 

The  remote  void  no  arguments  function  invocation 
cost,  round  trip  time,  is  shown  in  Table  2.  Figure  8  shows 
the  cost  of  remote  void  function  invocation  whose  argU’ 
ments  vary  from  0  to  8. 

According  to  these  results,  the  cost  of  both  Pentium  to 
Sun  and  Alpha  to  Sun  is  greater  than  the  one  of  Sun  to  Pen¬ 
tium  and  Sun  to  Alpha,  i.e.  about  8  micro  second  extra  cost 
is  added.  Thus,  we  have  further  investigated  the  detailed 
cost  analysis. 

The  cost  of  marshalling/unmarshalling  of  integers  is  less 
than  two  micro  seconds  as  shown  in  Figure  9.  Figure  10 
shows  the  jnarshalName  and  ^unmarshalName  costs 
when  the  _dispatcher2<F,  Al,  A2>:  :hinvoker 
function  symbol  is  passed.  Those  function  costs  depend  on 
the  symbol  name  length  because  of  making  the  hash  key 
for  the  function  name  table  and  comparison  of  strings.  As 
shown  in  this  figure,  the  ^unmarshalNaine  cost  on  Sun  is 
greater  than  others.  This  leads  to  an  extra  cost. 


Figure  9.  Integer  Data 

Marshalling/Unmarshalling  Cost 


Figure  10.  Function  Address 
Marshalling/Unmarshalling  Cost 
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6.  Related  Work 


Figure  11.  Remote  double  Data  Write 


Figure  12.  Revised  Remote  double  Data  Write 


5.3.  Remote  Memory  Write  Cost 

Figure  11  shows  the  bandwidth  of  a  remote  memory 
write  whose  area  contains  a  double  data  type  using  the 
nwrite  method  of  the  GlobalPtr<double>  object. 
The  bandwidth  between  different  processors  is  low  due  to 
the  data  conversion  cost. 

As  shown  below,  the  C++  template  allows  us  to  write  a 
special  nwrite  method  for  the  double  data  type  so  that 
no  conversion  is  required  between  Pentium  and  Alpha  ma¬ 
chines.  Figure  12  shows  the  result  of  this  improvement. 

1  GlobalPtr<double>: : nwrite (double  *data/ 

2  int  nitem) 

3  { 

4  /*  special  method  for  double  */ 

5  } 


The  template  technique  for  the  function  invocation  and 
global  pointer  described  in  this  paper  has  also  been  realized 
by  ABC++[8]  and  HPC++  Lib[2].  As  far  as  we  know,  the 
ABC++  runtime  only  supports  the  homogeneous  computers 
environment.  The  HPC++[2]  provides  a  registration  mech¬ 
anism  for  a  remote  function  invocation  in  the  heterogeneous 
environment.  This  means  that  a  program  running  in  the  ho¬ 
mogeneous  environment  differs  from  a  program  running  in 
the  heterogeneous  environment. 

Though  the  HPC++  does  not  assume  the  SPMD  execu¬ 
tion  model  in  the  heterogeneous  environment,  the  technique 
described  in  this  paper  can  be  applied  as  follows:  Instead  of 
the  registration,  a  function  defined  in  another  program  is  de¬ 
fined  with  a  dummy  body  so  that  the  function  address  and  its 
symbol  are  locally  defined.  Of  course,  the  function  address 
is  meaningless  except  that  the  address  is  used  to  search  the 
symbol  name  locally.  The  symbol  name  is  passed  to  the  re¬ 
ceiver  where  the  symbol  is  mapped  to  a  function  address  in 
the  receiver. 

CC++[7]  is  a  language  extension  to  C++  such  that  global 
pointers,  synchronization  structures,  and  other  parallel  con¬ 
structs  are  introduced.  The  CC++  runtime  system  has  run  in 
the  heterogeneous  computers  environment  using  the  Globus 
runtime  system[l].  MPC++  realizes  the  same  functionality 
of  the  global  pointer,  synchronization  structure,  and  remote 
function  invocation  without  any  language  modification  and 
it  does  not  require  a  special  compiler  on  either  the  homoge¬ 
neous  or  heterogeneous  computers  environments. 

7.  Concluding  Remarks 

In  this  paper,  after  presenting  an  overview  of  the  MPC++ 
parallel  programming  language,  the  MPC++  new  runtime 
system  for  heterogeneous  computing  has  been  designed, 
implemented,  and  evaluated.  If  an  MPC++  program  con¬ 
tains  remote  function  invocations  whose  arguments  are  only 
basic  data  types,  the  program  runs  in  both  the  homogeneous 
and  heterogeneous  environment  without  any  changes.  If 
a  remote  function  invocation  contains  a  data  structure,  the 
user  needs  to  define  its  marshalling/unmarshalling  functions 
in  the  current  implementation.  In  order  to  avoid  such  user- 
level  definitions,  we  are  now  developing  a  generator  for 
those  functions  using  the  MPC++  metalevel  architecture[6]. 

The  preliminary  evaluation  results  show  that  some  im¬ 
provement  on  the  the  ^unmarshalName  function  is 
needed.  There  are  two  methods.  One  is  that  the  hash  key 
is  pre-calculated  and  stored  in  the  name  table  so  that  the 
sender  is  passed  to  the  receiver.  The  -unmarshallName 
on  the  receiver  does  not  involve  the  hash  generation.  An¬ 
other  improvement  method  is  that  all  addresses  and  symbols 
information  on  remote  processors  are  kept  in  all  processors 
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so  that  the  remote  address  of  a  function  is  resolved  on  the 
sender.  This  does  not  require  any  extra  cost  on  the  receiver 
side.  However,  the  table  size  is  N  times  larger,  where  the 
N  is  the  number  of  different  processor  types. 

In  terms  of  performance  heterogeneity,  large  size  data 
conversion  should  be  performed  on  a  faster  processor  in¬ 
stead  of  converting  it  on  the  sender.  To  realize  this,  the  host 
table  must  contain  performance  information  so  that  place¬ 
ment  of  the  data  conversion  operation  is  decided  at  the  run¬ 
time. 

The  MPC++  major  application,  so  far,  is  a  global  operat¬ 
ing  system  called  SCore-D[4, 3]  which  enables  a  multi-user 
environment  on  clusters  of  computers.  We  are  currently 
porting  the  SCore-D  to  the  heterogeneous  environment. 

The  authors  believe  that  this  paper  contributes  a  tem¬ 
plate  technique  to  solve  issues  on  a  runtime  system  of 
the  heterogeneous  environment,  i.e.,  data  conversion,  mar¬ 
shalling/unmarshalling  arguments,  and  function  address 
resolution.  Though  the  MPC-t-i-  assumes  the  SPMD  pro¬ 
gramming  style,  the  technique  is  so  general  that  it  is  appli¬ 
cable  to  other  parallel  and  distributed  C++  language  sys¬ 
tems. 
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Abstract 

The  goal  of  this  paper  is  to  report  our  findings  as 
to  which  CORBA  services  are  ready  to  support  dis¬ 
tributed  system  software  in  a  heterogeneous  environ¬ 
ment.  In  particular^  we  implemented  intercommunic¬ 
ation  between  components  in  our  Management  System 
for  Heterogeneous  Networks  (MSHN^ )  using  four  dif¬ 
ferent  CORBA  mechanisms:  the  Static  Invocation  In¬ 
terface  (Sll)f  the  Dynamic  Invocation  Interface  (DII)j 
Untyped  Event  Services,  and  Typed  Event  Services. 
MSHN^s  goals  are  to  manage  dynamically  changing 
sets  of  heterogeneous  adaptive  applications  in  a  het¬ 
erogeneous  environment.  We  found  these  mechanisms 
at  various  stages  of  maturity,  resulting  in  some  being 
less  useful  than  others.  In  addition,  we  found  that  the 
overhead  added  by  CORBA  varied  from  a  low  of  10.6 
milliseconds  per  service  request  to  a  high  of  279. 1  mil¬ 
liseconds  per  service  request  on  workstations  connec¬ 
ted  via  100  Mbits/sec  Ethernet.  We  therefore  conclude 
that  using  CORBA  not  only  substantially  decreases  the 
amount  of  time  required  to  implement  distributed  sys¬ 
tem  software,  but  it  need  not  degrade  performance. 

1  Introduction 

This  paper  describes  the  experiences  we  had  using 
CORBA  mechanisms  to  implement  intercommunica¬ 
tion  in  MSHN.  MSHN's  goal  is  to  support  the  execution 
of  multiple,  disparate,  adaptive  applications^  in  a  dy¬ 
namic,  distributed  heterogeneous  environment.  To  ac¬ 
complish  this  goal,  MSHN  consists  of  multiple,  distinct, 
and  eventually  replicated  distributed  components  that 
themselves  execute  in  a  heterogeneous  environment. 

*This  research  was  supported  by  DARPA  under  contract 
number  E583.  Additional  support  was  provided  by  the  Naval 
Postgraduate  School  and  the  Institute  for  Joint  Warfare  Analysis. 

^  Pronounced  “mission” 

^This  paper  focuses  on  the  use  of  CORBA  mechanisms  to  sup¬ 
port  the  components  of  MSHN,  not  the  applications  that  MSHN 
itself  supports.  For  more  details  concerning  applications,  please 
see  the  references  or  contact  the  the  authors  directly. 


These  components  have  widely  varying  functionality, 
come  in  and  out  of  existence,  and  communicate  across 
heterogeneous  networks.  In  addition  to  executing  on 
different  types  of  platforms,  these  components  are  also 
likely  to  be  written  in  different  programming  languages. 
We  can,  of  course,  at  the  expense  of  a  great  deal  of  pro¬ 
grammer’s  time,  implement  specialized  naming  services 
to  locate  the  appropriate  component  at  run-time,  and 
specialized  communication  mechanisms  to  enable  com¬ 
munication  between  the  heterogeneous  platforms  upon 
which  the  components  run.  Alternatively,  we  can  use 
a  general  tool,  such  as  the  Common  Object  Request 
Broker  Architecture  (CORBA),  to  achieve  the  same 
functionality  while  reducing  our  development  time.  Ex¬ 
perience  with  generalized  systems,  such  as  CORBA, 
has  revealed  that  the  reduction  in  development  time 
costs  come  at  the  expense  of  run-time  performance, 
which  can  be  critical  in  real-time  applications.  This  re¬ 
search,  therefore,  investigates  the  utility  and  overhead 
of  communication  mechanisms,  which  are  implemented 
according  to  the  CORBA  2.2  specification,  to  support 
MSHN’s  inter-component  communication. 

We  note  to  the  reader  that  our  interest  lies  in  the 
CORBA  mechanisms  that  support  the  development 
of  (possibly  real-time)  resource  management  environ¬ 
ments.  This  is  a  very  specific  realm  where  system  over¬ 
heads  can  have  a  significant  impact  on  performance. 
We  do  not  explore  the  many  and  varied  capabilities  of 
CORBA  for  the  supporting  of  other  environments,  such 
as  that  of  distributed  general  database  services  and 
video  streaming.  Our  interest  in  CORBA  is  primarily 
as  a  tool  to  reduce  the  time/programming  investment 
needed  to  implement  our  resource  management  system 
middleware.  As  the  services  and  mechanisms  provided 
by  the  CORBA  2.2  specification,  particularly  Static  and 
Dynamic  Invocation,  and  the  Event  Services,  hold  great 
promise  in  this  regard,  we  performed  the  series  of  stud¬ 
ies  detailed  in  this  paper. 

CORBA  specifies  a  standard  to  permit  different  pro- 
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grams,  executing  on  different  computers,  to  request 
services  from  one  another.  CORBA^s  Naming  Ser¬ 
vice  and  Object  Request  Brokers  (ORBs)  aid  clients 
in  locating  appropriate  servers.  CORBA’s  static  in¬ 
vocation  enables  a  CORBA  client  to  make  a  request 
of  a  server  that  is  identified  prior  to  compile  time.  It 
provides  both  reliable  synchronous  semantics  and  unre¬ 
liable  asynchronous  semantics.  In  contrast,  CORBA’s 
dynamic  invocation  enables  the  client  to  locate  a  server 
that  may  not  be  known  until  run-time,  and  provides  re¬ 
liable  synchronous  and  asynchronous  semantics,  as  well 
as  unreliable  asynchronous  semantics.  CORBA’s  event 
services  allow  processes  on  one  machine  to  place  event 
notifications  intended  for  processes  on  other  machines 
into  event  queues  so  that  the  notifications  can  later  be 
delivered  to  the  serving  processes.  This  service  facil¬ 
itates  multicast.  This  paper  will  not  cover  CORBA  in 
detail,  but  there  are  many  other  good  references  on  the 
subject  [1,  2,  3,  4,  5,  6,  7,  8,  9,  10,  11]. 

The  paper  is  organized  as  follows.  We  first  briefly 
describe  MSHN,  concentrating  on  the  type  of  inter¬ 
communication  that  is  required  by  its  components.  A 
more  complete  description  of  MSHN  can  be  found  else¬ 
where  [12].  Alternate  designs  for  facilitating  commu¬ 
nication  within  MSHN  itself  and  the  implementation  of 
these  designs  are  presented.  These  designs  are  based 
upon,  respectively,  static  invocation,  dynamic  invoca¬ 
tion,  untyped  event  service  and  typed  event  service.  In 
this  section,  we  also  provide  a  qualitative  assessment 
detailing  the  problems  that  we  encountered  while  at¬ 
tempting  to  use  these  mechanisms  within  MSHN.  In 
a  subsequent  section,  we  describe  our  experiments  for 
evaluating  these  mechanisms  within  MSHN  and  present 
a  quantitative  analysis  of  each  of  the  mechanisms.  Fi¬ 
nally,  we  summarize  our  findings. 

2  The  Management  System  for  Hetero¬ 
geneous  Networks  (MSHN) 

In  the  Heterogeneous  Processing  Laboratory  at  the 
Naval  Postgraduate  School,  we  are  designing,  imple¬ 
menting,  and  testing  a  resource  management  system 
called  the  Management  System  for  Heterogeneous  Net¬ 
works  (MSHN).  MSHN  is  designed  as  a  general  ex¬ 
perimental  platform  for  investigating  issues  relating  to 
the  design  and  construction  of  future  resource  man¬ 
agement  systems  operating  in  heterogeneous  environ¬ 
ments,  Though  MSHN  is  used  to  explore  a  large  num¬ 
ber  of  such  issues,  our  present  research  focuses  on  find¬ 
ing  and  developing  (1)  mechanisms  for  supporting  ad¬ 
aptive  applications,  (2)  mechanisms  for  supporting  the 
satisfaction  of  user  and  system  defined  Quality  of  Ser¬ 
vice  (QoS)  requirements,  and  (3)  mechanisms  for  ac¬ 
quiring  and  usefully  aggregating  measurements  of  both 
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Figure  1:  MSHN  Conceptual  Architecture 

general  resource  availability  and  the  resource  usage  of 
individual  tasks.  A  thorough  and  complete  description 
of  MSHN  can  be  found  in  Hensgen  [12]. 

MSHN’s  architecture  consists  of  multiple  instanti¬ 
ations  of  each  of  the  components  enumerated  below: 

•  a  Client  Library  (one  for  each  executing  applica¬ 
tion  to  be  managed  by  MSHN), 

•  a  Scheduling  Advisor  (hierarchically  replicated), 

•  a  Resource  Requirement  Database  (hierarchically 
replicated) , 

•  a  Resource  Status  Server  (hierarchically  replic¬ 
ated),  and 

•  a  MSHN  Daemon  (when  needed). 

Figure  1,  the  MSHN  Conceptual  Architecture,  shows 
all  of  the  MSHN  components  (shaded)  as  translucent 
layers  executing  on  distributed  platforms.  A  translu¬ 
cent  layer  is  one  that  can  be  bypassed  by  layers  that 
are  above  or  below  it.  For  example,  the  MSHN  Dae¬ 
mon  (mshnd)  can  interact  directly  with  the  operating 
systems  layer,  bypassing  the  Resource  Status  Server, 
the  Resource  Requirement  Database  and  the  Schedul¬ 
ing  Advisor.  In  the  environment  that  MSHN  supports, 
both  MSHN  and  non-MSHN  applications  may  be  ex¬ 
ecuting  at  any  given  time.  Figure  2  illustrates  how  these 
components,  along  with  various  MSHN  and  non-MSHN 
applications,  might  actually  be  distributed  among  dif¬ 
ferent  heterogeneous  machines. 
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Figure  2:  Example  MSHN  Physical  Instantiation 


This  research  investigates  how  communication 
between  the  components  can  be  facilitated.  As  such, 
the  MSHN  description  in  the  remainder  of  this  section 
emphasizes  that  communication. 

Figure  3,  MSHN’s  Software  Architecture,  illustrates 
all  of  the  interactions  between  the  components.  MSHN 
has  a  peer-to-peer  architecture^ . 

We  now  present  two-  and  three-tier  views  to  give 
a  clear  understanding  of  the  interactions  between  the 
components.  Generally,  many  applications,  each  linked 
with  the  MSHN  Client  Library,  will  be  running  at  any 
given  time.  They  will  need  to  communicate  with  a 
Scheduling  Advisor  (SA)  to  request  the  appropriate  re¬ 
sources  needed  to  start  new  processes.  They  may  also 
communicate  with  a  MSHN  Daemon  when  receiving 
their  recommended  schedule.  Additionally,  their  Client 
Libraries  update  the  Resource  Requirement  Database 
(RRD)  and  the  Resource  Status  Server  (RSS)  with  the 
expected  resource  requirements  of  the  applications  and 
current  resource  availability  within  the  MSHN  system. 

^When  callbacks  are  used  the  client  and  the  server  have  a 
peer-to-peer  relationship.  In  distributed  systems,  callbacks  are 
useful  as  a  mechanism  for  performing  asynchronous  communic¬ 
ation.  Callbacks  transmit  event  notifications  without  blocking 
the  event  originator.  Callbacks  flow  from  the  servers  towards  the 
clients. 


Figure  3:  MSHN’s  Software  Architecture 


Figure  4  illustrates  this  updating  interaction  as  a  two- 
tiered  client/server  architecture.  The  arrows  labeled 
“1”  designate  the  Resource  Requirements  Database  up¬ 
date  path,  and  those  labeled  “2,”  the  Resource  Status 
Server  update  path.  The  update  frequency  of  the  Re¬ 
source  Status  Server  is  expected  to  be  high  so  that  it,  in 
turn,  can  supply  the  Scheduling  Advisor  with  accurate 
and  current  information. 

We  anticipate  that  the  frequency  of  the  updates  will 
load  down  the  network,  and  cause  a  considerable  pro¬ 
cessing  load  on  the  Resource  Status  Server  and  the  Re¬ 
source  Requirement  Database.  To  avoid  these  loads, 
MSHN's  design  includes  proxy  Resource  Status  Servers 
and  Resource  Requirement  Databases  that  will  come  in 
and  out  of  existence  as  required  to  minimize  the  number 
of  updates.  These  proxies  will  filter  gathered  informa¬ 
tion  and  update  the  hierarchical  Resource  Status  Server 
and  the  hierarchical  Resource  Requirement  Database 
when  necessary. 

In  one  view,  the  Scheduling  Advisor  functionally 
resides  between  the  information  needed  to  create  a 
schedule  (the  Resource  Status  Server  and  the  Resource 
Requirement  Database)  and  the  requesters  of  schedules 
(applications  linked  with  the  Client  Library).  This  in¬ 
dicates  that  there  will  be  a  high  communication  rate 
to  and  from  the  Scheduling  Advisor.  We  can  there¬ 
fore  also  view  MSHN  as  having  three  tiers,  where  the 
Scheduling  Advisor  is  the  second  tier,  and  the  Resource 
Status  Server  and  the  Resource  Requirement  Database 
are  in  the  third  tier  (see  Figure  5).  When  the  Client 
Library  (first  tier)  contacts  the  Scheduling  Advisor  for 
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Figure  4:  Two-tiered  Architectural  View  of  MSHN  Ar¬ 
chitecture 


a  schedule,  either  directly  or  via  the  MSHN  Daemon 
(the  arrows  labeled  ‘T”  and  “la”),  the  Scheduling  Ad¬ 
visor  queries  both  the  Resource  Status  Server  (arrows 
“2”  and  “3”),  and  the  Resource  Requirement  Database 
(arrows  “4”  and  “5”)  before  it  computes  its  schedule 
and  sends  it  to  the  MSHN  Daemon  or  client  library  de¬ 
pending  upon  which  is  more  appropriate  (arrows  “6” 
and  “6a”) 

Although  the  Client  Libraries  are  the  initiators  of 
many  of  the  communication  chains  through  the  MSHN 
system,  other  chains  are  initiated  by  the  Resource 
Status  Server.  For  example,  in  the  case  where  a  viola¬ 
tion  of  a  deadline  occurs  because  of  a  change  in  resource 
availability,  the  Resource  Status  Server  will  trigger  the 
Scheduling  Advisor  to  reschedule  processes  that  would 
not  otherwise  meet  their  deadline.  The  Scheduling  Ad¬ 
visor  will  adapt  to  the  new  situation  by  either  changing 


Figure  6:  Alternate  Three-tiered  View  of  MSHN 


the  format"^  of  the  process  or  restarting  it  on  a  differ¬ 
ent  resource,  possibly  via  the  MSHN  Daemon.  This 
interaction  is  the  reverse  of  the  previously  described 
communication  chain  and  can  be  used  to  define  another 
version  of  a  three-tiered  view.  (See  Figure  6.) 

Although  we  have  shown  several  two  and  three  tier 
views  of  MSHN,  the  reader  should  understand  that 
these  are  only  examples.  Much  larger  chains  will  actu¬ 
ally  exist  when  the  various  components  are  hierarchic¬ 
ally  replicated. 

3  Use  of  CORBA  Services  in  MSHN 
and  Problems  Encountered 

Our  goal  is  to  determine  both  (1)  how  we  can  best 
facilitate  efficient  communication  between  the  compon¬ 
ents  in  our  architecture  using  mechanisms  from  the 
CORBA  2.2  specification,  and  (2)  to  determine  the  run¬ 
time  overhead  of  each  of  those  mechanisms.  Our  justi¬ 
fication  for  choosing  a  particular  mechanism  included 
extensibility,  scalability,  portability,  flexibility,  and  ef¬ 
ficiency. 

MSHN  consists  of  multiple,  eventually  replicated, 
distinct  distributed  components  that  execute  in  a  het¬ 
erogeneous  environment.  These  components  will  have 
widely  varying  functionality,  will  come  in  and  out  of  ex¬ 
istence,  will  communicate  via  heterogeneous  networks, 
and  will  execute  on  different  platforms.  To  facilitate 
the  interactions  between  MSHN’s  components,  we  iden¬ 
tified  four  mechanisms  from  the  CORBA  2.2  specifica- 

^We  use  the  term  “format”  to  refer  to  a  mechanism  we  have 
developed  to  support  adaptive  applications  [13]. 
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tion  that  had  particular  promise:  the  Typed  Event  Ser¬ 
vice,  the  Untyped  Event  Service,  the  Static  Invocation 
Interface  (SII),  and  the  Dynamic  Invocation  Interface 
(DII).  After  settling  on  these  four  mechanisms,  we  im¬ 
plemented  a  prototype  of  MSHN’s  communication  in¬ 
frastructure  using  each  of  them.  First  we  describe  how 
the  MSHN  architecture  would  benefit  from  the  both 
the  Typed  and  Untyped  Event  Service,  the  Static  In¬ 
vocation  Interface  (SII),  and  the  Dynamic  Invocation 
Interface  (DII).  Then  we  discuss  how  we  use  the  Nam¬ 
ing  Service  within  MSHN  to  obtain  object  references. 
In  this  section,  since  part  of  the  objective  of  this  paper 
is  to  make  recommendations  with  regards  to  additions 
and  improvements  to  the  evolving  CORE  A  specifica¬ 
tion,  we  describe  and  justify  each  of  our  designs,  the 
problems  we  encountered,  and  the  solutions  to  which 
we  arrived. 

3.1  Selection  of  a  CORBA  ORB 

At  the  beginning  of  this  research,  we  explored  vari¬ 
ous  implementations  of  the  CORBA  standard.  Fig¬ 
ures  7  and  8  present  a  summary  of  the  results  of  that 
exploration®.  Based  upon  various  requirements,  includ¬ 
ing  the  cost  of  some  of  the  implementations,  the  time 
required  to  implement  comparative  tests,  and  the  dur¬ 
ation  of  this  study,  we  had  to  limit  ourselves  to  one 
CORBA  implementation.  We  chose  the  implementa¬ 
tion  that  seemed,  at  that  time,  to  have  the  most  ma¬ 
ture  features  relevant  to  MSHN.  Our  assumption  was 
that  once  such  an  implementation  was  found,  other  im¬ 
plementations  would  typically  have  similar  difficulties 
and  comparable  performance.  As  such,  we  based  our 
studies  around  lONA’s  Orbix,  the  implementation  that 
best  fit  this  requirement. 

3.2  Event  Service 

Event  Service  allows  multiple  suppliers  and  multiple 
consumers  to  deliver  and  receive  notifications  for  a  set 
of  events.  An  Event  Channel  transparently  permits 
(1)  suppliers  to  send  notifications  of  events  and  (2) 
consumers  to  receive  these  notifications,  all  without 
knowledge  of  the  existence  of  one  another.  Hence,  the 
Event  Service  will  support  the  transparent  replication 
of  MSHN  system  components  for  reliability  and  de¬ 
pendability.  Event  Service  will  enable  Client  Libraries, 
linked  with  different  concurrent  applications,  to  com¬ 
municate  with  other  MSHN  components  seamlessly.  Fi¬ 
nally,  Event  Service  supports  a  standard  Application 
Programming  Interface  (API)  (e.g.,  for  the  Push-Push 

^The  capabilities  of  the  various  implementations  of  CORBA 
evolve  very  quickly.  The  content  of  these  figures  present  the 
state  of  some  of  the  implementations  at  the  time  this  research  was 
performed.  As  the  capabilities  of  most  CORBA  implementations 
can  quickly  change,  the  reader  is  recommended  to  do  his  own 
similar  exploration. 


Model,  a  single  operation  push()  taking  a  variable  of 
type  any  as  a  parameter)  which  eases  the  development 
of  MSHN  system  components. 

Though  there  are  four  models  for  Event  Service, 
there  were  only  two  available  in  relatively  robust  in¬ 
dustrial  implementations  when  we  performed  our  ex¬ 
periments:  the  Push-Push  Model  and  the  Pull-Pull 
Model  [14].  Using  the  Pull-Pull  Model  creates  an  ad¬ 
ditional  load  on  the  consumers.  Because  our  servers, 
the  consumers  in  this  case,  must  minimize  their  use  of 
computing  resources  even  when  there  is  no  event  to  be 
delivered  on  the  Event  Channel,  we  chose  to  use  only 
the  Push-Push  Model. 

3.2.1  Using  Event  Service  in  MSHN 

Figure  9  illustrates  the  use  of  Event  Service  to  organize 
communication  in  the  MSHN  architecture.  In  this  ap¬ 
proach,  the  components  of  MSHN  must  register  them¬ 
selves  as  both  a  consumer  and  a  supplier  to  the  Event 
Channel.  The  Event  Channel  acts  as  the  glue  between 
all  of  the  components  and  delivers  notifications  to  each 
of  them. 

3.2.2  Problems  with  Initial  Approach 

Although  this  approach  helps  to  organize  MSHN’s 
communication,  providing  transparent  reliability  and 
scalability,  some  problems  can  be  seen  involving  both 
performance  and  the  CORBA  2.2  specification.  Some 
of  the  problems  with  this  approach  are  identical  to  the 
problems  identified  by  Schmidt  and  Vinoski  in  the  ana¬ 
lysis  of  their  stock  market  application  [11].  We  first 
summarize  their  findings  in  the  first  two  items  below, 
Loss  of  Events  in  the  System,  and  Problems  with  the 
Untyped  Event  Service.  Then  we  enumerate  additional 
problems  that  are  particular  to  using  CORBA  within 
the  MSHN  architecture.  Lastly,  we  look  at  how  to  im¬ 
plement  a  component  that  is  both  a  supplier  and  a  con¬ 
sumer. 

Loss  of  Events  in  the  System.  Event  Service 
guarantees  delivery  of  notifications  to  all  registered  con¬ 
sumers  as  long  as  the  Event  Service  process  does  not 
fail®.  However,  in  the  Event  Service  specification,  per¬ 
sistency  of  events  in  the  Event  Channel  is  not  required. 
Therefore,  if  an  Event  Service  process  does  fail,  un¬ 
delivered  notifications  in  the  system  may  be  lost. 

The  loss  of  notifications  is  fatal  for  MSHN  because 
we  are  creating  an  environment  for  mission-critical  ap¬ 
plications.  The  obvious  solution  to  this  problem  is  to 

®  Although  there  are  many  definitions  of  failure,  we  specifically 
mean  that  if  the  Event  Service  does  not  fail,  then  all  consumers 
receive  the  correct  value.  This  agrees  with  Lamport’s  definition 
of  failure  [15]. 
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Figure  7:  Available  Services 


Figure  8:  Available  Services  (Continued) 
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Figure  9:  Using  Event  Service  in  MSHN 


redefine  the  Event  Service  specification  to  include  per¬ 
sistency  for  the  undelivered  notifications  in  the  Event 
Channel.  The  OMG  has  been  defining  this  requirement 
in  the  Notification  Service  specification  [7].  However, 
no  vendors  had  implemented  this  new  specification  at 
the  time  of  this  research. 

Problems  with  Untyped  Event  Service.  The 
Untyped  Event  Service  does  not  specify  any  way  to  fil¬ 
ter  notifications.  Therefore  when  using  this  service,  all 
notifications  are  received  by  all  registered  consumers. 

Passing  all  of  these  notifications  in  MSHN,  many  of 
which  will  be  discarded  by  any  particular  consumer, 
through  the  network  will  increase  the  network  load 
between  the  Event  Channel  and  the  consumer.  Ad¬ 
ditionally,  the  consumers  must  filter  events  and  convert 
the  parameters  that  have  type  any  to  the  type  that  is 
expected.  In  this  case,  there  is  an  additional  and  un¬ 
wanted  load  on  the  consumers  to  process  all  the  events 
received.  Finally,  when  more  suppliers,  in  particu¬ 
lar  more  applications,  register  with  the  Untyped  Event 
Channel,  more  events  will  be  generated  in  the  system. 
Since  the  Untyped  Event  Channel  delivers  each  event 
to  all  of  the  registered  consumers  and  the  consumers 
will  filter  all  the  events,  the  network  load  and  consumer 
load  will  increase  rapidly. 

To  handle  this  problem,  we  can  use  Typed  Event 
Channels  which  filter  the  notifications  according  to  their 
type.  With  this  solution,  the  consumers  receive  only 
the  notifications  for  which  they  register,  decreasing  the 
network  traffic.  In  this  solution,  one  Event  Channel 
processes  all  of  the  notifications  and  delivers  them  only 


Figure  10:  Using  UntypedEvent  Service 


to  the  corresponding  consumers.  This  also  lightens  the 
loads  on  the  consumers  because  they  avoid  having  to  ex¬ 
amine  and  discard  events  not  meant  for  them.  However, 
we  note  that  it  increases  the  computational  load  on  the 
Event  Channel.  Later,  we  compare  the  run-time  per¬ 
formance  of  Typed  Event  Channel  to  Untyped  Event 
Channel  using  this  approach  in  the  MSHN  architecture. 

Alternatively,  since  we  only  have  five  different  types 
of  components  in  MSHN,  we  could  use  different  chan¬ 
nels  for  each  connection  between  these  components.  In 
this  approach,  each  Event  Channel  will  only  support 
one  notification  type.  For  example,  for  the  Client  Lib¬ 
rary  -  Scheduling  Advisor  Event  Channel,  we  will  have 
the  Client  Library  as  a  supplier,  the  Scheduling  Ad¬ 
visor  as  a  consumer,  and  the  possible  client  scheduling 
requests  as  the  types  of  the  notifications.  Each  MSHN 
component  may  be  replicated  by  registering  the  addi¬ 
tional  (identical)  components  to  the  same  Event  Chan¬ 
nel.  This  solution  is  shown  in  Figure  10. 

Obviously,  some  combination  of  these  two  solutions 
may  be  best.  That  is,  the  Typed  Event  Channel  itself 
can  become  a  bottleneck  in  the  first  solution.  There¬ 
fore,  replication  of  Typed  Event  Channels  may  better 
fit  MSHN’s  requirements.  In  this  paper,  we  focused  on 
the  careful  analysis  of  individual  solutions  rather  than 
empirically  exploring  the  exponentially  sized  solution 
space  that  combining  these  two  techniques  will  create. 

How  to  implement  a  component  that  is  both 
a  supplier  and  a  consumer  in  a  system  in  order 
to  minimize  the  run-time  overhead.  All  compon¬ 
ents  of  MSHN  are  both  consumers  and  suppliers.  Also, 
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and  perhaps  particular  to  MSHN,  when  a  component 
receives  a  notification,  it  usually  becomes  a  supplier 
by  generating  another  notification  and  delivering  it  to 
the  appropriate  Event  Channel.  Figure  11  shows  the 
process  of  passing  notifications  from  the  Client  Lib¬ 
rary  to  the  Scheduling  Advisor  using  the  pushO  oper¬ 
ation.  It  reveals  how  the  Scheduling  Advisor  changes 
from  a  consumer  to  a  supplier.  In  the  Untyped  Event 
Service’s  Push-Push  Model,  the  supplier  (here  the  Cli¬ 
ent  Library)  invokes  a  default  push()  operation  on  the 
Event  Channel  which  in  turn  invokes  a  pushO  opera¬ 
tion  supplied  by  the  developer  of  the  consumer  (here  the 
Scheduling  Advisor).  In  the  push()  operation  that  the 
developer  supplied  for  the  Scheduling  Advisor  (as  a  con¬ 
sumer),  the  developer  of  the  Scheduling  Advisor  invokes 
the  default  push  operation  on  the  Scheduling  Advisor 
-  Resource  Requirement  Database  (SA  -  RRD)  Event 
Channel  (which  of  course,  invokes  the  push( )  operation 
supplied  by  the  developer  of  the  Resource  Requirements 
Database) . 


Figure  11:  Using  pushO  Operation 


The  design  issue  here  is  to  determine  how  to  sup¬ 
ply  the  Interoperable  Object  Reference  (lOR)  of  the 
SA  -  RRD  Event  Channel  to  the  pushO  operation  of 
the  Scheduling  Advisor.  We  want  to  avoid  using  the 
Naming  Service  every  time  the  pushO  operation  (here 
the  push  operation  of  the  Scheduling  Advisor)  is  in¬ 
voked.  Instead,  the  developer  can  locate  the  SA-RRD 
Event  Channel  in  the  servant  implementation.  That  is, 
the  servant  implementation  will  obtain  the  10 R  for  the 
SA-RRD  Event  Channel,  stringify  the  lOR,  and  stor¬ 
ing  it  in  a  file.  The  pushO  operation  implementation 
can  retrieve  these  lORs  from  their  files,  as  needed,  and 
deliver  generated  events,  thereby  pushing  the  corres¬ 
ponding  notifications  to  the  channel. 

Therefore  in  the  Untyped  Event  Service,  to  react 
to  the  notification  (here  a  request  for  a  schedule)  that 
the  consumer  receives,  the  developer  of  the  consumer 
(here  the  Scheduling  Advisor)  must  override  the  default 
pushO  operation  between  the  Event  Channel  and  the 
consumer.  For  example,  when  the  Scheduling  Advisor 
receives  an  event  from  the  Client  Library  requesting  a 
schedule,  it  will  generate  a  query  notification  for  the 
Resource  Requirement  Database  and  deliver  it  to  the 
SA  -  RRD  Event  Channel.  In  this  case,  the  Scheduling 
Advisor  becomes  a  supplier  and  is  required  to  locating 
the  SA  -  RRD  Event  Channel.  To  avoid  locating  the 
Event  Channel  to  which  the  supplier  will  deliver  the 
notification,  via  the  Naming  Service  inside  the  pushO 
operation,  the  developer  can  locate  the  Event  Channel 
in  the  servant  implementation  and  obtain  lORs  of  it. 
Then,  the  servant  implementation  can  stringify  these 
lORs  and  store  them  in  files. 


3 . 3  Remote  Invo  cations 

In  this  section,  we  discuss  using  remote  invocations 
to  coordinate  the  interactions  of  MSHN’s  components. 
Since  both  the  Static  Invocation  Interface  (SII)  and  the 
Dynamic  Invocation  Interface  (DII)  have  similar  remote 
invocation  mechanisms,  we  first  define  the  general  prob¬ 
lems  encountered  with  both,  and  then  enumerate  any 
additional  ones  that  are  specific  to  the  DII. 

The  same  functionality  described  above  using  the 
Event  Service  can  be  implemented  using  remote  invoc¬ 
ation.  The  most  important  difference  is  that  the  rep¬ 
lication  of  the  components  is  not  as  easy  as  it  is  using 
Event  Service.  To  support  replication  using  remote  in¬ 
vocation,  clients  must  make  multiple  invocations  rather 
than  just  the  one  needed  in  Event  Service. 

3.3.1  General  Approach  using  Remote  Invoca¬ 
tion 

Figure  12  shows  our  approach  that  uses  remote  invoc¬ 
ations  (i.e.,  either  the  Static  Invocation  Interface  (SII) 
or  the  Dynamic  Invocation  Interface  (DII))  to  establish 
inter-component  communication  in  the  MSHN  archi¬ 
tecture.  We  chose  from  two  communication  methods 
available  in  both  the  SII  and  DII:  one-way  invocation 
and  synchronous  invocation,  depending  upon  whether 
reliable  communication  is  required. 

When  using  the  SII,  a  component  requires  compile¬ 
time  knowledge  of  the  Interface  Description  Language 
(IDL)  interface  of  the  target  component  from  which  it 
will  request  a  service.  In  contrast,  the  same  compon¬ 
ent,  using  the  Event  Service,  makes  its  request  via  a 
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Figure  12:  Using  Remote  Invocations  in  MSHN 


standard  API  that  is  independent  of  the  target  com¬ 
ponent  and  its  functionality.  However,  when  using  the 
DII,  the  components  of  MSHN  can  invoke  operations  on 
other  components  without  requiring  precompiled  stubs. 
Thus,  we  may  substitute  different  instantiations  of  such 
components  without  requiring  a  re-linking.  Addition¬ 
ally,  using  the  DII  allows  us  to  invoke  objects  using  de¬ 
ferred  synchronous  invocation.  Such  invocation  is  not 
available  from  the  SII  within  the  current  CORE  A  2.2 
specification.  With  deferred  synchronous  invocation, 
the  clients  may  continue  their  computation  instead  of 
waiting  for  the  results  of  the  previously  invoked  opera¬ 
tions  to  be  delivered. 

3.3.2  Problems  with  Using  the  Initial  Remote 
Invocation  Approach 

We  now  enumerate  some  problems  with  our  initial  re¬ 
mote  invocation  approach. 

Lack  of  a  Standard  Thread  Mechanism.  Our 
first  design  decision  was  to  implement  the  remote  in¬ 
vocations  with  threads,  i.e.,  handling  each  invocation 
of  a  component  using  a  different  thread.  Using  threads 
would  avoid  any  data  synchronization  problems  and 
support  fairness  for  each  schedule  request.  However, 
the  COREA  2.2  specification  does  not  define  how  the 
threads  must  be  implemented.  Therefore,  each  vendor 
has  come  up  with  their  own  solution,  leading  to  applic¬ 
ations  that  are  non-portable.  For  example,  if  you  use 
lONA’s  Orbix  as  your  development  environment,  and 
lONA’s  Filters  to  implement  your  threads,  you  cannot 
use  the  same  implementation  on  Inprise’s  Visibroker 


because  Inprise’s  solution  for  handling  threads  uses  In¬ 
terceptors. 

We  avoided  non-compliant  extensions  of  the  vendor 
when  implementing  our  prototypes.  Therefore,  we  were 
unable  to  use  threads  for  any  of  our  prototypes,  al¬ 
though  the  usage  of  threads  would  have  improved  the 
throughput  of  schedule  requests. 

Best-Effort  Semantics.  One-way  invocation  has 
best-effort  semantics.  Thus,  there  is  no  guarantee  that 
the  requested  method  is  actually  invoked.  In  this  mech¬ 
anism,  the  client  continues  its  processing  immediately 
after  initializing  the  request  and  never  synchronizes 
with  the  completion  of  the  request.  Hence,  one-way 
invocation  is  not  a  good  mechanism  for  most  of  the 
MSHN  system  because  it  is  not  reliable. 

However,  using  one-way  invocations  for  frequent 
short-term  updates  could  be  cost  effective  in  some  cases 
in  MSHN.  There  are  two  advantages  to  selectively  using 
best-effort  asynchronous  semantics  between  MSHN’s 
Client  Library  and  Resource  Status  Server.  First,  the 
Client  Library  can  continue  its  computation  immedi¬ 
ately  without  blocking.  Second,  we  expect  that  the  Re¬ 
source  Status  Server  will  be  updated  very  frequently. 
Therefore,  we  can  afford  the  delay  needed  to  get  the 
accurate  status  of  a  resource  with  the  next  update  in¬ 
stead  of  forcing  the  use  of  a  more  reliable  transmission 
mechanism. 

3.3.3  Problems  with  Our  Initial  Approach  that 
are  Specific  to  using  DII 

We  now  enumerate  some  problems  with  our  initial  ap¬ 
proach  that  are  specific  to  using  DII. 

The  Additional  Overhead  of  the  DII.  A  straight 
forward  DII  approach  requires  5-6  method  invocations 
in  order  to  invoke  a  single  remote  method:  looking 
up  the  interface  name,  getting  the  operation  identi¬ 
fier/parameters,  and  creating  the  request  (which  may 
also  be  remote).  This  would  add  a  lot  of  overhead  to 
run-time  performance,  which  would  be  unacceptable  in 
MSHN’s  architecture. 

In  MSHN  however,  we  know  the  interface  of  the  com¬ 
ponents,  i.e.,  the  operation  identifier,  the  parameters 
and  the  return  type,  when  we  are  developing  the  cli¬ 
ent  applications.  Thus,  we  can  obtain  the  flexibility 
and  benefits  the  DII’s  deferred  synchronous  invocation, 
without  having  to  pay  the  overhead  of  querying  the  In¬ 
terface  Repository  for  the  interface  information.  We  do 
note  that  if  a  deferred  synchronous  invocation,  such  as 
Promises  [16],  had  been  specified  as  part  of  COREA’s 
static  invocation  interface,  the  use  of  DII  would  not  be 
necessary  in  this  case.  We  compare  the  performance  of 
the  SII  and  DII  in  the  results  section. 
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3.4  Using  the  Naming  Service 

We  used  the  Naming  Service  to  obtain  object  ref¬ 
erences  in  each  of  our  prototypes.  For  the  static  and 
dynamic  invocation  interfaces,  all  components  must  re¬ 
solve  names  only  once,  when  they  are  instantiated,  to 
obtain  lORs  via  the  Naming  Service.  References  within 
all  components,  except  the  Client  Library,  are  stored  in 
files  for  future  use  as  we  described  previously.  The  com¬ 
ponents  do  not  use  the  Naming  Service  unless  the  lORs 
that  they  have  are  no  longer  valid.  We  use  the  excep¬ 
tion  handling  mechanism  in  CORBA  to  catch  non- valid 
lORs,  and  then  use  the  Naming  Service  to  obtain  new 
valid  ones. 

To  improve  the  run-time  performance  of  the  Event 
Service  implementations,  we  registered  each  compon¬ 
ent  with  the  appropriate  Event  Channel.  We  resolve 
the  Event  Channel  references  using  the  Naming  Service. 
Then  we  query  the  Event  Channels  to  obtain  the  refer¬ 
ences  for  the  Proxy  Push  Suppliers,  stringify  them,  and 
then  store  them  in  files.  When  a  component  receives 
an  event,  and  generates  another  event  in  response  to 
the  one  it  received,  that  component  reads  the  appropri¬ 
ate  file  to  obtain  the  stringified  reference  and  uses  this 
reference  to  push  the  event  to  the  corresponding  Event 
Channel. 

4  Quantitative  Results 

We  described  our  design  decisions  for  implementing 
our  prototypes  in  the  previous  section.  In  this  section, 
we  discuss  the  performance  results  of  these  different 
prototypes.  First,  we  describe  our  test  bed.  Then  we 
explain  our  tests  and  enumerate  their  results. 

4.1  Hardware  and  Software  Used  in  the 
Test  Bed 

As  discussed  earlier  at  the  beginning  of  this  re¬ 
search,  we  surveyed  the  available  implementations  of 
CORBA  to  determine  what  services  were  supported. 
(See  Figure  7  and  8.)  Based  upon  the  robustness  and 
availability  of  services,  particularly  the  Typed  Event 
Service,  we  chose  IONA  Technologies’  CORBA  imple¬ 
mentation,  specifically  Orbix]VIT2.3c,  OrbixNamesl.lc, 
OrbixEventl.Oc  (Untyped  Event  Service)  and  Or- 
bixEventl.Ob  (Typed  Event  Service)  built  using  the 
SunSparc  C-f+  Compiler  4.1. 

We  ran  our  tests  on  SunSparc  Station  10  hosts  with 
300MHz  CPUs  and  128  MB  of  RAM  each,  running  the 
Solaris  2.6  operating  system.  The  hosts  were  connected 
via  a  100  Mbits/sec  Ethernet  LAN. 

To  obtain  correct  results  in  the  tests  utilizing  the 
network,  we  used  the  Network  Time  Protocol  to  syn¬ 
chronize  the  system  clocks  of  the  hosts.  We  found  that 
the  system  clock  on  the  SunSparc  10  has  a  skew  of  ap¬ 
proximately  3  milliseconds  every  15  minutes.  Therefore 


in  order  to  minimize  the  difference  between  the  vari¬ 
ous  system  clocks,  we  synchronized  the  clocks  every  5 
minutes  and  ran  the  tests  immediately  after  the  syn¬ 
chronization. 

4.2  Experiments 

We  determined  the  overhead  of  each  CORBA  mech¬ 
anism  on  a  single  machine,  and  then  measured  the  re¬ 
sponse  times  over  the  network  of  the  various  mechan¬ 
isms,  that  is,  the  total  time  required  to  service  1000 
scheduling  requests.  This  interval  begins  when  the  Cli¬ 
ent  Library  requests  a  schedule  from  the  Scheduling 
Advisor  and  includes  all  processing  up  until  the  time 
that  the  Client  Library  receives  a  response.  This  dura¬ 
tion  includes  the  time  spent  querying  the  Resource  Re¬ 
quirement  Database  and  the  Resource  Status  Server. 
At  the  time  of  this  testing,  we  did  not  have  a  fully 
functional  Scheduling  Advisor,  so  we  emulated  its  ex¬ 
ecution  by  having  the  thread  that  was  computing  a 
schedule  pause  for  .5  seconds.  We  chose  this  duration 
based  upon  the  average  execution  time  of  a  set  of  11 
scheduling  algorithms  proposed  for  MSHN’s  repertoire 
by  Siegel  [17]. 

To  assess  the  overhead  of  CORBA,  we  included  one 
non-CORBA  test.  This  base  case  consists  of  an  ap¬ 
plication  linked  with  all  the  MSHN  components  and 
executing  as  a  single  process  on  a  single  host.  This 
non-CORBA  test  uses  local  method  invocation  to  per¬ 
form  MSHN  component  intercommunication.  In  order 
to  assess  CORBA’s  overhead,  we  performed  two  sets 
of  tests.  In  the  first  set,  we  compared  this  base  case 
against  test  cases  where  we  ran  all  the  MSHN  compon¬ 
ents  on  the  same  machine  and  had  them  communicate 
via  CORBA  mechanisms.  In  the  second,  we  compared 
the  latter  tests  against  ones  where  the  MSHN  compon¬ 
ents  are  distributed  across  different  machines. 

With  the  exception  of  the  non-CORBA  base  case, 
we  ran  all  tests  both  on  a  single  machine  and  over  the 
network  using  different  workstations  to  execute  each 
of  the  Client  Library,  the  Resource  Status  Server,  the 
Resource  Requirements  Database  and  the  Scheduling 
Advisor. 

All  single  machine  CORBA  tests  were  executed  us¬ 
ing  four  different  processes.  The  non-CORBA  single 
machine  tests  executed  completely  in  a  single  process, 
with  all  MSHN  calls  being  implemented  as  ordinary 
C++  function  calls.  In  implementing  both  static  in¬ 
vocation  and  dynamic  invocation  for  a  single  machine, 
we  used  synchronous  semantics. 

The  average  inter-arrival  rate  of  schedule  requests 
varies  with  the  facility  and  time  of  day.  Therefore,  we 
ran  all  of  our  tests  for  two  different  circumstances.  In 
the  first,  the  inter- arrival  rate  of  the  requests  is  less 
than  the  service  time,  i.e.,  each  request  is  completed 
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by  the  system  before  the  next  request  arrives  on  av¬ 
erage.  The  second  represents  the  situation  that  exists 
in  the  middle  of  a  burst.  In  this  case,  the  inter-arrival 
rate  of  the  requests  is  greater  than  the  service  time,  i.e., 
some  requests  must  be  queued  to  be  handled  later.  The 
first  case  is  important  in  determining  performance  un¬ 
der  normal  conditions,  but  it  is  equally  important  for 
us  to  determine  that  the  system  neither  (1)  fails  com¬ 
pletely  when  heavily  loaded,  nor  (2)  incurs  overhead 
that  varies  exponentially  with  the  number  of  requests 
pending.  Indeed,  no  typed  event  service  that  we  have 
tested  to  date  could  pass  the  above  stress  tests. 

Unfortunately,  the  system  clocks  had  insufficient 
granularity  to  measure  precisely  the  total  time  to  pro¬ 
cess  a  single  request  in  our  non-CORBA  implementa¬ 
tion.  We  therefore  first  read  the  system  clock.  We  then 
generate  a  request  and  await  its  response,  repeating 
this  1000  times.  Lastly,  we  read  the  clock  again,  and 
determine  the  total  time  (for  1000  consecutive  request- 
response  pairs).  Because  requests  are  generated  con¬ 
secutively,  and  because  each  request  uses  synchronous 
semantics  to  make  the  invocations,  we  call  this  set  of 
tests,  the  consecutive  synchronous  tests. 

To  simulate  the  case  where  many  requests  occur 
within  a  short  time  frame,  we  generated  requests  every 
.06  seconds,  on  average,  in  our  base  case.  For  this  set 
of  tests,  we  used  asynchronous  calls  within  the  applic¬ 
ation  to  start  the  schedule  request  chain  in  the  DII  and 
SII  implementations.  Event  Service  is  meant  to  be  used 
asynchronously,  so  there  was  no  special  programming 
required  to  implement  these  cases.  We  call  this  set 
the  bursty  asynchronous  tests  because  during  such 
a  burst,  the  requests  arrive  faster  than  the  expected 
required  service  time  and  queue  up  for  the  Scheduling 
Advisor. 

For  another  of  our  projects,  Schnaidt  and  Duman 
implemented  a  fully  optimized  version  of  an  applica¬ 
tion  using  sockets  and  compared  it  to  an  equivalent 
CORBA  implementation  to  determine  CORBA’s  over¬ 
heads  when  running  over  the  network  [18].  As  such, 
we  did  not  implement  such  a  socket  implementation 
of  MSHN.  In  the  following  paragraphs,  we  draw  some 
conclusions  based  both  on  the  Schnaidt-Duman  exper¬ 
iments  and  those  reported  here. 

4.3  Results 

We  summarize  our  quantitative  results  in  Figure  13. 
The  times  shown  are  the  actual  execution  times,  in 
seconds,  for  1000  requests.  We  have  included  a  schedul¬ 
ing  time  of  .5  seconds  per  request  and  have  not  simu¬ 
lated  the  execution  time  of  the  application. 

In  order  to  fully  understand  these  results,  we  must 
first  explain  some  anomalies  that  we  observed  in  the 
Unix  calls  we  used  to  emulate  the  Scheduling  Advisor 


Config. 

Communi  cation 
Mechanism 

Local 

Network 

Consec. 

Synch. 

Non-CORBA 

500.1 

N/A 

SII 

511.4 

520.0 

DII 

IgSgi 

530.4 

Untyped  Event 

593.9 

Typed  Event 

580.5 

779.2 

Non-CORBA 

N/A 

SII 

mam 

510.8 

DII 

521.2 

520.2 

Untyped  Event 

592.8 

564.4 

Typed  Event 
(for  100  requests) 

64.7 

63.6 

Figure  13:  Results  of  the  Generic  Experiments  for  1000 
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(select 0)  and  the  request  generation  inter-arrival 
rate  (ualarmO).  The  average  of  the  actual  select () 
times  was  125  microseconds  more  than  the  requested 
.5  seconds.  We  also  observed  an  average  of  10  mil¬ 
liseconds  error  for  the  ualarmO  requests  of  60  milli¬ 
seconds. 

As  expected,  there  is  significant  overhead  in  using 
CORBA  for  communication,  and  therefore  across  more 
than  one  address  space,  as  compared  to  local  invoc¬ 
ations  within  a  single  address  space.  In  our  earlier 
project,  we  noted  similar  results  as  well  as  substantial 
overhead  when  an  optimized  non-CORBA  local  socket 
implementation  was  compared  to  a  local  CORBA  im¬ 
plementation  [18].  The  efficiency  of  the  socket  imple¬ 
mentation  on  a  single  machine  is  due  to  its  use  of  shared 
memory.  However,  even  if  a  CORBA  implementation 
used  shared  memory,  comparable  performance  would 
not  be  obtained.  Unfortunately,  the  CORBA  specifica¬ 
tion  requires  all  parameters  of  a  request  to  be  conver¬ 
ted  to  an  external,  machine  independent  data  repres¬ 
entation,  even  if  the  target  object  resides  on  the  same 
machine.  Also,  in  that  earlier  project,  we  noted  that 
a  networked  CORBA  implementation,  which  required 
less  than  5%  of  the  time  to  implement  as  compared 
to  the  socket  implementation,  had  only  20%  more  run¬ 
time  overhead.  Since  our  results  are  comparable  here, 
and  because  we  did  not  implement  a  highly  optim¬ 
ized  MSHN  socket  implementation,  we  will  limit  the 
remainder  of  our  remarks  to  comparing  the  perform¬ 
ance  of  various  CORBA  implementations  of  MSHN. 

Static  invocation  is  generally  the  fastest  intercom¬ 
munication  mechanism  available  in  CORBA  [1].  Even 
though  dynamic  invocation  is  generally  much  slower,  we 
see  that  the  performance  of  dynamic  invocation,  when 
we  know  the  interfaces  at  development  time,  is  close 
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to  that  of  static  invocation.  However,  we  note  that  the 
most  efficient  implementation  would  likely  be  available 
from  a  deferred  synchronous  Static  Invocation  Inter¬ 
face.  We  recommend  that  such  semantics,  similar  to 
those  in  Promises  [16],  be  considered  for  adoption  into 
the  CORE  A  specification. 

The  comparison  between  the  consecutive  synchron¬ 
ous  and  bursty  asynchronous  tests  seems  surprising  at 
first  glance.  One  would  normally  expect  that  a  system 
loaded  with  bursty  requests  would  not  perform  better 
than  an  unloaded  system.  To  understand  the  reason  for 
this  performance  improvement,  we  must  further  elabor¬ 
ate  on  the  client  application’s  use  of  the  Naming  Ser¬ 
vice.  In  the  consecutive  case,  the  Client  Library  obtains 
the  reference  of  the  Scheduling  Advisor  from  the  Nam¬ 
ing  Service  immediately  prior  to  making  each  request. 
However,  in  the  bursty  asynchronous  case,  the  Client 
Library  obtains  all  of  the  references  asynchronously. 
Thus  in  the  bursty  asynchronous  case,  obtaining  these 
references  overlaps  with  the  actual  computation.  Un¬ 
fortunately,  we  will  only  expect  to  see  this  improvement 
in  the  actual  MSHN  implementation  if  the  Scheduling 
Advisor  is  executing  on  a  dual  processor  machine.  In 
our  experiments,  the  emulated  Scheduling  Advisor  is 
actually  blocked  while  the  Naming  Service  is  resolving 
addresses. 

In  the  4-machine  network  tests,  the  number  of  con¬ 
text  switches  required  between  MSHN’s  components 
and  the  Object  Request  Broker  is  substantially  re¬ 
duced.  Multiple  components  actually  execute  simul¬ 
taneously,  and  thus  run-times  were  smaller. 

As  seen  in  Figure  13,  the  Untyped  Event  Service 
adds  more  overhead  than  either  static  or  dynamic  invoc¬ 
ation  because  the  Event  Service  process  is  the  bottle¬ 
neck  in  the  system.  Of  course  in  an  overall  evaluation, 
this  additional  overhead  must  be  balanced  against  the 
reduced  cost  with  which  information  can  be  delivered 
to  replicated  system  components. 

In  addition  to  the  tests  described  above,  we  replic¬ 
ated  the  Untyped  Event  Service  to  see  whether  any 
speedup  could  be  obtained  by  distributing  the  load  of 
the  Event  Service  process.  First  we  created  two  Event 


Service  processes,  one  on  the  same  host  as  the  applic¬ 
ation  and  the  other  on  the  same  host  as  the  Schedul¬ 
ing  Advisor,  in  an  attempt  to  achieve  some  speed  up. 
This  approach  performed  worse  than  the  single  Event 
Service  process.  Upon  analysis,  we  determined  that 
it  introduced  unnecessary  network  communication  and 
placed  the  Event  Service  processes  on  the  busiest  hosts. 
Then  we  moved  the  Event  Service  processes  to  the  same 
hosts  as  the  Resource  Requirements  Database  and  the 
Resource  Status  Server.  Figure  15  shows  the  speedup 
we  observed  with  this  configuration.  We  also  ran  tests 
using  four  distributed  Event  Service  processes.  Un¬ 
fortunately,  probably  because  of  the  excessive  amount 
of  communication,  this  approach  performed  no  better 
than  using  a  single  Event  Service  process. 

In  MSHN’s  Typed  Event  Service  implementation,  all 
of  the  communication  passes  through  a  single  process. 
The  CORBA  implementations  that  we  used^  failed  in 
this  bursty  asynchronous  case.  In  Figure  13,  we  in¬ 
clude  the  time  required  to  process  100  requests  for  the 
bursty  asynchronous  case.  Since  the  current  imple¬ 
mentations  of  Typed  Event  Service  do  not  allow  replic¬ 
ation,  we  could  not  run  a  replicated  test  with  the  Typed 
Event  Service  as  we  did  with  the  Untyped  Event  Ser¬ 
vice.  Hence,  we  believe  that  the  Typed  Event  Service 
is  not  ready  to  be  used  in  middleware  to  support  het¬ 
erogeneous  distributed  computing. 

5  Conclusions 

In  this  paper,  we  described  our  experiences  using 
mechanisms  of  the  CORBA  2.2  specification  to  facilit¬ 
ate  communication  in  a  resource  management  system 
that  is  both  designed  to  manage  distributed  hetero¬ 
geneous  applications,  and  is  itself  distributed  and  het¬ 
erogeneous.  In  our  qualitative  assessment  of  CORBA 
2.2,  we  found  several  minor  problems  and  recommen¬ 
ded  the  addition  of  deferred  asynchronous  semantics 
to  CORBA’s  Static  Invocation  Interface.  We  found 
that  both  CORBA ’s  static  invocation  and  dynamic 
invocation,  when  used  solely  to  obtain  asynchronous 
semantics,  were  efficient  enough  to  support  distrib¬ 
uted  heterogeneous  resource  management  systems.  We 
found  that  substantial  work  is  needed  to  provide  imple¬ 
mentations  of  Typed  Event  Services  that  can  handle 
the  loads  placed  on  them  when  requests  occur  in  a 
bursty  fashion.  We  also  determined  that  while  Untyped 
Event  Services  add  substantial  overhead  as  compared 
to  static  invocation,  they  may  still  be  desirable  in  the 
case  where  multicast  of  requests  is  desired,  particularly 
if  they  are  replicated  and  themselves  wisely  allocated 
to  machines  in  the  system.  In  summary,  many  of  the 

^ Typed  Event  Service  is  new  in  the  CORBA  2.2  specification 
and  not  many  CORBA  products  have  this  service  available  as 
yet. 
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existing  CORBA  services  can  be  quite  useful  in  im¬ 
plementing  resource  management  systems  for  hetero¬ 
geneous  computing,  and  other  CORBA  services  hold 
substantial  promise  for  the  future. 
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Abstract 

In  this  paper,  a  method  for  estimating  task  execu¬ 
tion  times  is  presented,  in  order  to  facilitate  dynamic 
scheduling  in  a  heterogeneous  metacomputing  environ¬ 
ment  Execution  time  is  treated  as  a  random  variable 
and  is  statistically  estimated  from  past  observations. 
This  method  predicts  the  execution  time  as  a  function  of 
several  parameters  of  the  input  data,  and  does  not  re¬ 
quire  any  direct  information  about  the  algorithms  used 
by  the  tasks  or  the  architecture  of  the  machines.  Tech¬ 
niques  based  upon  the  concept  of  analytic  benchmark¬ 
ing/code  profiling  [7]  are  used  to  accurately  determine 
the  performance  differences  between  machines,  allow¬ 
ing  observations  to  be  shared  between  machines.  Ex¬ 
perimental  results  using  real  data  are  presented. 
Keywords:  Heterogeneous  Distributed  Computing,  Ex¬ 
ecution  Time  Estimation,  Nonparametric  Regression, 
Analytic  Benchmarking,  Distance  Matrices. 


1  Introduction 

Heterogeneous  metacomputing  is  a  type  of  paral¬ 
lel  computing,  where  a  large,  distributed  network  of 
heterogeneous  machines  is  used  as  a  single  computa¬ 
tional  entity.  Applications  executing  in  this  environ¬ 
ment  consist  of  a  set  of  coarse-grained,  precedence- 
constrained  tasks,  where  the  precedence  structure  can 
be  represented  using  a  directed  acyclic  graph  (DAG). 
The  performance  of  an  application  in  this  environment 
is  largely  determined  by  the  manner  in  which  these 
tasks  are  assigned  to  the  machines;  the  construction  of 
such  and  assignment  is  called  the  matching  and  schedul¬ 
ing  problem.  Matching  and  scheduling  algorithms  need 
to  know  the  execution  time  of  each  task  on  each  ma¬ 
chine  to  perform  well,  and  most  matching  and  schedul¬ 
ing  algorithms  for  DAGs  in  the  literature  assume  that 


the  execution  time  of  a  given  task  is  a  known  quantity. 
However,  the  execution  time  of  a  task  on  a  given  ma¬ 
chine  depends  upon  many  factors,  including  the  prob¬ 
lem  size  and  the  input  data,  and  is  not  trivial  to  de¬ 
termine  a  priori.  In  a  heterogeneous  environment,  the 
wide  variety  of  machine  architectures  further  compli¬ 
cates  the  process  of  determining  the  execution  time, 
since  the  execution  time  is  also  machine  dependent. 
Methods  are  clearly  needed  which  can  accurately  pre¬ 
dict  the  execution  time  of  a  task  on  a  variety  of  ma¬ 
chines  as  a  function  of  the  features  of  the  data  set.  This 
problem  is  called  the  execution  time  estimation  prob¬ 
lem. 

In  the  literature,  there  are  three  major  classes  of  so¬ 
lutions  to  the  execution  time  estimation  problem:  code 
analysis  [18],  analytic  benchmarking/code  profiling  [7, 
11,12,16,20,22,23]  and  statistical  prediction  [3,10, 
13].  In  code  analysis,  an  execution  time  estimate  is 
found  through  analysis  of  the  source  code  of  the  task. 
A  given  code  analysis  technique  is  typically  limited  to 
a  specific  code  type  or  a  limited  class  of  architectures. 
Thus,  these  methods  are  not  very  applicable  to  a  broad 
definition  of  heterogeneous  computing,  and  will  not  be 
examined  here. 

A  class  of  methods  which  are  more  useful  in  a 
heterogeneous  metacomputing  environment  is  analytic 
benchmarking/code  profiling.  Analytic  benchmark¬ 
ing/code  profiling  was  first  presented  by  Freund  [7],  and 
has  been  extended  by  Pease  et  al.  [16],  Yang  et  al.  [22, 
23],  Khokhar  et  al.  [11, 12],  and  Siegel  [20].  Analytic 
benchmarking  defines  a  number  of  primitive  code  types. 
On  each  machine,  benchmarks  are  obtained  which  de¬ 
termine  the  performance  of  the  machine  for  each  code 
type.  Code  profiling  attempts  to  determine  the  compo¬ 
sition  of  a  task,  in  terms  of  the  same  code  types.  The 
analytic  benchmarking  data  and  the  code  profiling  data 
are  then  combined  to  produce  an  execution  time  esti¬ 
mate.  Analytic  benchmarking/code  profiling  has  two 
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disadvantages.  First,  it  lacks  a  proven  mechanism  for 
producing  an  execution  time  estimate  from  the  bench¬ 
marking  and  profiling  data  over  a  wide  range  of  algo¬ 
rithms  and  architectures.  Second,  it  cannot  easily  com¬ 
pensate  for  variations  in  the  input  data  set.  However, 
analytic  benchmarking  is  a  powerful  comparative  tool 
in  that  it  can  determine  the  relative  performance  differ¬ 
ences  between  machines. 

The  third  class  of  execution  time  estimation  algo¬ 
rithms,  statistical  prediction  algorithms,  make  predic¬ 
tions  using  past  observations.  A  set  of  past  observations 
is  kept  for  each  machine,  which  are  used  to  make  new 
execution  time  predictions.  The  matching  and  schedul¬ 
ing  algorithm  uses  these  predictions  (and  other  infor¬ 
mation)  to  choose  a  machine  to  execute  the  task.  While 
the  task  executes  on  the  chosen  machine,  the  execu¬ 
tion  time  is  measured,  and  this  measurement  is  subse¬ 
quently  added  to  the  set  of  previous  observations.  Thus, 
as  the  number  of  observations  increases,  the  estimates 
produced  by  a  statistical  algorithm  will  improve.  Statis¬ 
tical  prediction  algorithms  have  been  presented  by  Iver¬ 
son  et  al.  [10],  Kidd  et  al.  [13],  and  Devarakonda  and 
Iyer  [3].  Statistical  methods  have  the  advantages  that 
they  are  able  to  compensate  for  parameters  of  the  input 
data  (such  as  the  problem  size)  and  do  not  need  any  di¬ 
rect  knowledge  of  the  internal  design  of  the  algorithm 
or  the  machine.  However,  statistical  techniques  lack  an 
intrinsic  method  of  sharing  observations  between  ma¬ 
chines.  By  allowing  observations  to  be  shared  between 
machines,  the  execution  time  estimate  on  a  machine 
with  few  observations  can  be  improved  by  using  obser¬ 
vations  from  machines  with  similar  performance  char¬ 
acteristics. 

Given  the  advantages  and  disadvantages  of  both  ana¬ 
lytic  benchmarking/code  profiling  and  statistical  meth¬ 
ods,  this  paper  presents  a  hybrid  method,  which  uses  an¬ 
alytic  benchmarking  techniques  to  create  a  unified  set  of 
observations  describing  both  the  input  data  features  and 
the  machine  capabilities.  This  unified  observation  space 
is  then  used  by  a  statistical  method  to  produce  execution 
time  estimates.  In  this  paper,  as  in  much  of  the  DAG 
scheduling  literature,  each  task  is  assumed  to  have  ex¬ 
clusive  use  of  the  machine  on  which  it  executes.  Thus, 
the  execution  time  of  a  task  is  not  a  function  of  the  other 
tasks  in  the  system,  and  is  only  a  function  of  the  ma¬ 
chine  capabilities  and  input  data.  This  method  models 
the  execution  time  of  a  task  as  a  random  variable,  allow¬ 
ing  the  matching  and  scheduling  algorithm  to  consider 
the  uncertainty  present  in  the  execution  time  estimate. 
(Several  papers  have  discussed  the  idea  of  scheduling 
with  random  quantities,  including  King  [14],  Tan  and 
Siegel  [21],  Armstrong  [2],  Li  et  al.  [15]  and  Hou  and 
Shin  [9].) 

The  remainder  of  this  paper  is  organized  as  follows. 


First,  the  stochastic  model  of  the  execution  time  of  a 
task  is  presented  in  Section  2.  Section  3  presents  the 
prediction  algorithm  which  uses  this  model.  Section  4 
presents  the  results  of  simulations  using  real  data,  and 
conclusions  are  offered  in  Section  5. 

2  Modeling  the  Execution  Time  as  a  Ran¬ 
dom  Variable 

The  execution  time  of  a  task  on  a  given  machine 
largely  depends  on  the  size  and  properties  of  the  in¬ 
put  data  set.  For  example,  the  execution  time  of  many 
matrix  algorithms  depends  upon  the  size  of  the  matrix. 
Furthermore,  if  the  matrix  algorithm  was  iterative  in  na¬ 
ture,  the  execution  time  may  also  depend  upon  the  con¬ 
dition  of  the  matrix  and  the  desired  precision  of  the  re¬ 
sults.  In  principle,  it  is  possible  to  quantify  these  prop¬ 
erties  of  the  input  data  set  as  a  vector  of  numeric  pa¬ 
rameters  X  =  [x^x^  x^].  Thus,  the  execution  time 

of  the  task  can  be  modeled  as  a  function  t  =  m{X)  of 
this  parameter  vector.  However,  in  many  instances,  it 
is  not  computationally  practical  to  determine  all  q  pa¬ 
rameters.  Therefore,  it  is  assumed  that  only  a  limited 
number  p  <  ^  of  these  parameters  will  be  explicitly 
modeled.  Thus,  the  parameter  vector  X  =  [x^x^  •  •  - 
will  be  used  to  model  the  execution  time  of  the  task. 
However,  the  presence  of  unmodeled  factors  will  cause 
a  certain  amount  of  error  to  be  present  in  an  estimate  of 
the  execution  time.  To  compensate  for  this  error,  the  ex¬ 
ecution  time  of  a  task  is  modeled  as  a  random  variable  t. 
This  random  quantity  can  be  represented  as: 

t  =  miX)+e,  (1) 

where  m(X)  is  deterministic,  and  e  is  purely  stochas¬ 
tic.  In  this  equation,  e  represents  the  unmodeled  fac¬ 
tors  affecting  the  execution  time,  while  m{X)  repre¬ 
sents  the  modeled  factors,  and  therefore  depends  upon  a 
p-dimensional  vector  of  parameters  X  =  •  •  •  x^j. 

In  essence,  m(X)  represents  the  mean  of  t  given  X, 
while  6  represents  the  zero-mean  random  error  present 
in  the  estimate. 

While  the  unmodeled  factors  which  affect  e  are  un¬ 
known,  it  is  possible  to  determine  their  effect  upon 
the  execution  time  indirectly  by  estimating  properties 
of  the  random  variable  e.  Additionally,  while  e  does 
not  directly  depend  upon  the  parameter  vector  X  = 
[x^x^  •  •  'X^],  in  practice,  e  does  display  some  depen¬ 
dence  on  the  modeled  parameters,  due  to  the  fact  that 
the  modeled  and  unmodeled  parameters  may  not  be  sta¬ 
tistically  independent.  Thus,  some  degree  of  correlation 
may  exist  between  these  sets. 

Given  this  model,  the  goal  of  the  execution  time  es¬ 
timation  problem  is  to  obtain  estimates  of  m(X)  and  e 
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for  some  given  parameter  vector  X.  Before  presenting 
the  specific  details  of  how  these  values  are  estimated, 
examples  of  how  two  real  algorithms  behave  under  this 
model  will  be  presented  to  illustrate  the  concepts  pre- 
sented  here.  The  first  example  shows  an  algorithm 
where  the  set  of  unmodeled  parameters  has  a  very  small 
effect  upon  the  execution  time  (i.e.  e  is  small).  Fig¬ 
ure  1  shows  the  execution  time  of  the  Cholesky  matrix 
decomposition  algorithm  for  various  problem  sizes  on 
a  single  machine.  The  relative  continuity  of  this  curve 
shows  that  e  has  a  very  small  impact  on  the  execution 
time. 

The  opposite  case  is  illustrated  in  Figure  2.  This 
algorithm  attempts  to  determine  if  a  given  number  is 
prime  through  the  process  of  trial  division.  The  exe¬ 
cution  time  of  this  algorithm,  for  a  given  number  n,  is 
essentially  a  function  of  the  smallest  prime  factor  of  the 
number  n.  However,  it  is  not  practical  to  compute  the 
smallest  prime  factor  of  the  number  n,  since  the  compu¬ 
tational  cost  of  this  problem  is  equivalent  to  determin¬ 
ing  if  the  number  is  prime.  However,  it  is  possible  to  use 
the  magnitude  of  the  number  as  a  parameter,  since  this 
value  bounds  the  magnitude  of  the  smallest  prime  fac¬ 
tor.  Thus  there  is  a  loose  correlation  between  the  mag¬ 
nitude  of  the  number  and  the  execution  time,  which  can 
be  seen  in  the  figure.  Thus,  even  in  this  extreme  exam¬ 
ple,  it  is  still  possible  to  obtain  some  information  about 
the  execution  time  of  the  task  which  can  be  used  by 
the  matching  and  scheduling  algorithm.  In  the  next  sec¬ 
tion,  the  techniques  used  to  estimate  the  values  of  m{X) 
and  e  will  be  presented. 


3  Execution  Time  Estimation  Algorithm 


The  algorithm  developed  in  this  paper  is  presented  in 
two  sections.  Section  3.1  poses  the  execution  time  esti¬ 
mation  problem  as  a  regression  problem,  and  presents 
a  k-Nearest  Neighbor  (k-NN)  regression  algorithm  to 
compute  estimates  from  a  set  of  previous  observations. 
For  clarity  of  presentation,  the  regression  algorithm  is 
described  using  only  the  vector  of  parameters  describ¬ 
ing  the  task  input  data.  While  this  algorithm  can  com¬ 
pute  the  execution  time  of  a  task  on  a  single  machine  as 
a  function  the  input  data  set,  it  lacks  an  intrinsic  ability 
to  share  observations  between  dissimilar  machines.  To 
eliminate  this  restriction,  the  regression  vector  is  aug¬ 
mented  in  Section  3.2  to  include  a  parameterization  of 
different  machines.  Thus,  the  execution  time  may  be 
estimated  using  previous  observations  as  a  function  of 
both  machine  type  and  task  characteristics. 


3.1  Nonparametric  Regression 

Given  that  the  execution  time  of  a  task  is  modeled 
as  a  random  variable  (as  in  equation  1),  the  goal  of  this 
paper  is  to  present  methods  to  obtain  estimates  m{X) 
and  i  of  7n{X)  and  e  for  a  given  a  parameter  vector  X 
which  characterizes  the  input  data  set.  This  will  be  ac¬ 
complished  through  the  use  of  a  set  of  n  previous  ob¬ 
servations  of  the  execution  time  { (U,  Xi)}f^i ,  where  U 
is  an  observed  execution  time  for  the  parameter  vec¬ 
tor  Xi.  The  parameter  vectors  Xi  of  the  n  previous 
observations  are  samples  in  the  parameter  space  (p- 
dimensional  real  vectors).  In  statistics,  this  problem  is 
called  a  regression  problem.  Note  that,  as  presented 
in  this  section,  each  machine  requires  a  separate  set  of 
observations.  This  restriction  will  be  relaxed  in  Sec¬ 
tion  3.2. 

There  are  a  variety  of  different  techniques  to  solve 
regression  problems,  which  can  be  divided  into  two 
classes:  parametric  techniques  and  nonparametric 
techniques.  In  general,  parametric  techniques  require 
knowledge  of  the  functional  form  of  m{X)  and  e. 
Since,  in  this  paper,  it  is  difficult  to  make  any  as¬ 
sumptions  about  the  functional  form  of  the  model  with¬ 
out  specific  knowledge  of  the  task  and  the  machine 
in  question,  parametric  techniques  are  not  well  suited 
to  this  problem  [10].  Nonparametric  regression  tech¬ 
niques  (also  called  nonparametric  estimators  or  smooth¬ 
ing  techniques)  are  considered  to  be  data  driven,  since 
the  estimate  depends  only  upon  the  set  of  previous  ob¬ 
servations,  and  not  on  any  assumptions  about  m{X) 
or  e.  Therefore,  nonparametric  techniques  will  be  used 
in  this  paper. 

All  nonparametric  regression  techniques  com¬ 
pute  m{X)  using  a  variation  of  the  equation 

rh{X)  =  -^Wi{X)ti  (2) 

where  Wi[X)  is  a  weighting  function,  or  kernel  [8]. 
Observe  that,  for  any  given  vector  X,  m{X)  is  a 
weighted  average  of  the  execution  time  values,  U,  of  the 
n  previous  observations.  The  weight  function  Wi{X) 
typically  assigns  higher  weights  to  observations  close 
to  the  parameter  X,  and  lower  weights  to  observations 
farther  away  from  X.  This  is  illustrated  in  Figure  3  for 
a  scalar  parameter  x  =  A.  In  practice,  many  nonpara¬ 
metric  regression  techniques  only  include  in  the  average 
points  within  some  neighborhood  of  the  parameter  X, 
making  the  estimate  m(X)  a  local  average  of  the  obser¬ 
vations  near  the  parameter  vector  X. 

A  similar  technique  can  be  used  to  determine  the 
properties  of  e.  As  mentioned  above,  e  is  a  zero-mean 
random  variable,  which  can  have  an  arbitrary  probabil¬ 
ity  density  function  (pdf).  This  arbitrary  nature  of  e 
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Figure  3.  Assigning  Weights  to  Observa¬ 
tions. 


Figure  4.  Effects  of  estimates  at  the 
boundary. 


makes  the  estimation  process  difficult.  To  maintain  sim¬ 
plicity,  only  an  estimate  of  the  variance  of  e  will  be 
computed  in  this  paper.  Given  an  estimate  rh{X), 
can  be  computed  to  be 

=  (3) 

In  this  paper,  the  k-Nearest  Neighbor  (k‘NN)  algo¬ 
rithm  is  used  (although,  in  principle,  any  nonparamet- 
ric  technique  could  be  adapted).  In  k-NN  smoothing, 
the  estimate  m(X)  is  constructed  from  the  k  observa¬ 
tions  with  parameter  vectors  closest  to  the  parameter 
vector  X.  With  regard  to  the  execution  time  estima¬ 
tion  problem,  there  are  two  primary  advantages  of  k- 
NN  smoothing.  First,  since  the  estimate  is  always  con¬ 
structed  from  an  average  of  k  points,  the  method  can 
easily  adapt  to  sparse  or  dense  regions  in  the  observa¬ 
tions.  Second,  the  method  can  be  implemented  in  a 
computationally  efficient  manner.  The  computational 
complexity  of  the  method  is  0{n{p  *f  logn)),  where  p 
is  the  dimensionality  of  the  parameter  vector,  and  n  is 
the  number  of  past  observations. 

While  conceptually  simple,  there  are  many  factors  to 
consider  when  using  the  k-NN  algorithm.  For  example, 
an  important  issue  in  k-NN  smoothing  is  the  choice  of 
the  value  k.  If  too  many  observations  are  included  in 
the  average,  the  bias  E{m{X)  —  m(X)}  will  be  too 
large,  and  the  details  of  m(X)  will  be  lost.  On  the  other 
hand,  if  too  few  observations  are  averaged,  the  vari¬ 
ance  E{{rh{X)  —  m{X))^}  of  m{X)  will  be  too  large, 
resulting  in  a  curve  which  is  “noisy.”  Choosing  k  to  ob¬ 
tain  a  balance  between  these  two  extremes  is  known  as 
the  bias-variance  tradeojf^  and  is  present  in  all  nonpara- 
metric  regression  techniques.  In  general,  k  should  grow 
in  proportion  to  n  such  that  k  oo  and  (k/n)  ->  0  as 
n  oo  [4]. 


Another  factor  in  the  design  of  a  nonparametric  re¬ 
gression  algorithm  is  the  ability  to  tolerate  erroneous 
data  points  in  the  set  of  observations.  (This  type  of  es¬ 
timator  is  said  to  be  a  robust  estimator.)  These  points, 
called  outliers,  do  not  conform  to  the  model  described 
in  equation  1.  The  technique  used  to  make  the  estima¬ 
tor  robust  is  called  L-Smoothing  [8],  where  a  fixed  per¬ 
centage  of  the  observations  with  the  largest  and  smallest 
values  of  U  are  eliminated  from  the  local  average. 

A  final  issue  encountered  when  using  nonparametric 
regression  techniques  is  the  behavior  of  the  estimates 
near  the  boundaries  of  the  set  of  observations  (i.e.,  no 
observations  lie  beyond  the  boundary).  As  the  param¬ 
eter  value  X  approaches  a  boundary,  the  local  aver¬ 
age  becomes  biased,  since  more  observation  points  will 
be  on  one  side  of  point  X  than  the  other.  The  one¬ 
dimensional  case  is  illustrated  in  Figure  4,  where  the 
estimated  function  m{x)  will  become  biased  near  the 
boundary  [6,8].  To  ensure  accurate  estimates  near  the 
boundary,  a  nonparametric  regression  technique  needs 
to  be  able  to  compensate  for  this  effect.  A  formal  defini¬ 
tion  of  the  k-NN  algorithm,  including  solutions  to  these 
issues,  is  presented  in  Appendix  A, 

32  Parameterizing  Machine  Performance 

If  the  k-NN  algorithm  is  used  as  presented  above, 
a  separate  set  of  observations  must  be  maintained  for 
each  machine.  This  condition  is  caused  by  the  lack 
of  a  mechanism  to  translate  performance  differences  of 
the  machines  into  numeric  parameters  which  can  be  in¬ 
corporated  into  the  execution  time  model  presented  in 
equation  1.  There  are  two  compelling  reasons  why  it 
is  desirable  to  eliminate  this  restriction  and  to  form  a 
unified  set  of  observations.  First,  separate  sets  makes 
the  process  of  adding  new  machines  (or  applications) 
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to  the  network  difficult,  since  a  few  initial  observations 
are  required  in  each  set  for  the  algorithm  to  function  ef¬ 
fectively.  Thus,  widespread  benchmarking  is  required 
to  obtain  such  an  initial  set  for  each  machine.  Second, 
a  starvation  problem  can  exist,  where  a  machine  with 
few  observations  will  tend  to  produce  poor  execution 
time  estimates.  If  these  poor  estimates  are  larger  than 
the  actual  execution  time,  it  is  unlikely  that  the  schedul¬ 
ing  algorithm  will  choose  to  execute  the  task  on  that 
machine.  Thus,  the  machine  will  not  get  any  new  ob¬ 
servations  from  which  estimates  could  be  improved. 

To  jointly  utilize  observed  execution  times  across  all 
machines,  a  method  is  needed  to  characterize  the  avail¬ 
able  machines  using  numeric  parameters  which  can 
then  be  included  in  the  parameter  vector.  This  needs  to 
be  accomplished  such  that  the  distance  between  any  two 
machines,  in  terms  of  their  parameterization,  is  a  rough 
indication  of  the  similarity  of  the  performance  of  those 
machines.  This  process  can  be  accomplished  through 
the  use  of  analytic  benchmarking  [7], 

Analytic  benchmarking  characterizes  the  perfor¬ 
mance  of  a  machine  using  a  series  of  benchmarks.  In 
theory,  each  of  these  benchmarks  should  correspond  to 
a  primitive  code  type;  code  types  form  a  basis  which 
can  exactly  characterize  the  performance  of  a  machine 
for  any  task.  Because  the  construction  of  an  ideal  set  of 
benchmarks  is  difficult  (if  not  impossible),  a  rigorous 
definition  of  primitive  code  types  is  avoided,  and  in¬ 
stead  it  is  assumed  that  a  reasonable  set  of  r  benchmarks 
is  available  to  approximate  the  performance  differences 
between  machines.  These  benchmarks  can  be  used  to 
span  Mr  where  each  axis  corresponds  to  the  results  of 
one  of  the  benchmarks,  either  in  terms  of  the  time  re¬ 
quired  to  execute  the  benchmark  or  the  rate  at  which 
the  machine  performs  iterations  of  the  benchmark.  This 
space  will  be  called  the  machine  space.  Thus,  a  ma¬ 
chine  i  can  be  represented  by  a  point  Bi  =  ■  6[] 

in  the  machine  space,  where  is  the  result  for  bench¬ 
mark  j  on  machine  i.  Bi  will  be  called  the  benchmark 
vector  for  machine  i. 

The  points  in  this  space  can  then  be  used  to  con¬ 
struct  an  augmented  parameter  vector,  which  can  be 
used  with  the  method  presented  in  Section  3.1.  Given 
the  r-dimensional  machine  space  defined  above,  and  a 
task  with  an  execution  time  that  is  a  function  of  a  vector 
of  parameters  X  =  [x^x^  -  •  rr^],  an  augmented  param¬ 
eter  vector  in  the  unified  parameter  space  can  be  con¬ 
structed  by  concatenating  the  benchmark  vector  Bi  and 
the  parameter  vector  X,  creating  an  (r-f-p) -dimensional 
parameter  vector.  Thus,  the  execution  time  observation 
of  the  task  on  machine  i  is  associated  with  the  aug¬ 
mented  parameter  vector  Y  =  •  •  •  x'^]. 

While  it  is  desirable  to  use  a  large  number  of  bench¬ 
marks  to  accurately  characterize  machine  performance, 


the  dimensionality  of  the  resulting  parameter  vector  is 
large.  An  increase  in  the  dimensionality  of  the  pa¬ 
rameter  vector  increases  the  computational  cost  of  the 
k-NN  smoothing  algorithm,  and  decreases  the  rate  at 
which  rh{X)  converges  towards  the  true  curve  m{X) 
(due  to  the  requirement  of  maintaining  the  bias- variance 
tradeoff)-  Therefore,  it  is  desirable  to  minimize  the  di¬ 
mensionality  of  the  benchmark  vector  Bi^  while  max¬ 
imizing  the  amount  of  information  it  contains.  Since 
the  distance  relationship  between  the  points  contains 
the  information  on  the  relative  performance  differences 
between  the  machines,  this  goal  can  be  accomplished 
by  reducing  the  number  of  dimensions  in  the  machine 
space,  while  attempting  to  preserve  the  distance  rela¬ 
tionship  between  the  points.  Potter  and  Chiang  [17] 
present  an  algorithm  that  can  be  used  to  embed  the 
benchmark  vectors  Bi  in  an  s-dimensional  subspace, 
where  s  <  r.  The  details  of  this  algorithm  are  given 
in  Appendix  B. 

This  embedding  creates  a  new  machine  parameter 
space  of  smaller  dimension,  called  the  reduced  machine 
space.  This  space  is  combined  with  the  parameters 
characterizing  the  input  data  to  yield  a  unified  param¬ 
eter  space,  as  outlined  above. 

3,3  Summary  of  the  Complete  Algorithm 

In  this  section,  the  AW  regression  technique  of  Sec¬ 
tion  3.1  is  applied  to  the  augmented  parameter  vec¬ 
tor  method  described  in  Section  3.2,  to  create  the 
completed  execution  time  estimation  algorithm.  The 
algorithm  begins  with  a  set  of  n  previous  observa¬ 
tions  of  the  execution  time  where  Yi  = 

■  •  •  h^^.x\x\  •  •  •  xf]  and  ji  is  the  machine  from 
which  observation  Yi  was  obtained.  Given  this  set,  a 
parameter  vector  X  =  [x^x^^  x^]  describing  the  in¬ 

put  data  set,  and  the  reduced  machine  space  Ms  contain¬ 
ing  a  point  Bj  =  •  •  •  b^]  for  each  machine  j,  pseu¬ 

docode  for  this  algorithm  can  be  constructed  as  follows. 

Execution  Time  Estimation  Algorithm; 
begin 

For  each  candidate  machine  j  with 
benchmark  vector  Bj  =  [b]ij  "-bj] 

begin 

Compute  m{Yj)  and  where 
Yj  =  •  •  •  bjX^x^  •  •  •  x^]. 

end 

Give  estimates  computed  above  to 
matching  and  scheduling  algorithm. 

The  algorithm  will  return  a 
machine  j  chosen  to  execute  the  task. 

Execute  task  on  machine  j,  and 
measure  the  execution  time  tn+ 1  * 

Add  observation  {tn+i ,  ^n+i)  to  the 
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set  of  previous  observations,  where 

n  =  n  +  1. 

end 

It  can  be  seen  that  every  time  a  given  task  is  run  on  a 
machine  in  the  system,  a  new  observation  is  added  to 
the  set  of  previous  observations.  Thus,  the  quality  of 
the  predictions  improves  with  time. 

One  issue  which  has  not  been  addressed  is  the  source 
of  an  initial  set  of  observations.  Since  the  execution 
time  estimate  is  a  function  of  the  set  of  previous  obser¬ 
vations,  at  least  one  initial  observation  is  required  when 
a  new  task/application  is  introduced  into  the  system. 
Thus,  the  task  must  be  executed  on  a  few  selected  ma¬ 
chines  in  order  to  obtain  a  few  initial  estimates.  These 
values  are  easily  obtained  during  the  development,  test¬ 
ing,  and  debugging  of  the  application. 

4  Evaluation 

To  evaluate  the  performance  of  the  methods  pre¬ 
sented  in  this  paper,  two  sets  of  experiments  were  per¬ 
formed  using  real  data.  A  machine  space  was  con¬ 
structed  using  the  10  benchmarks  from  the  Byte  bench¬ 
mark  suite  [1].  These  benchmarks  consist  of  a  variety 
of  integer  and  floating  point  benchmarks.  These  bench¬ 
marks  were  executed  on  16  different  machines  running 
different  flavors  of  UNIX.  The  results  from  these  bench¬ 
marks  were  normalized,  giving  all  benchmarks  equal 
weight.  This  10-dimensional  space  was  then  reduced 
to  a  3-dimensional  space  using  the  algorithm  outlined 
in  Section  3.2.  The  normalized  result  of  this  embedding 
is  pictured  in  Figure  5.  In  this  16  machine  environment, 
experiments  were  performed  using  real  data  obtained 
from  the  Cholesky  decomposition  and  trial  division  al¬ 
gorithms  presented  in  Section  2. 

The  first  set  of  experiments  emulates  the  situation 
where  a  new  machine  is  added  to  the  network.  This 
experiment  compares  the  performance  of  the  execution 
time  estimation  algorithm  when  observations  can  and 
cannot  be  shared  between  machines.  In  this  experi¬ 
ment,  the  execution  time  of  the  Cholesky  decomposi¬ 
tion  algorithm  was  estimated  on  a  single  machine  for  5 
different  matrix  sizes.  The  number  of  observations  on 
the  machine  was  varied  between  1  and  50.  The  average 
prediction  error  is  compared  for  three  different  simula¬ 
tions.  The  results  of  these  three  simulations  are  given  in 
Figure  6,  which  show  how  the  average  error  in  m(X) 
changes  as  the  number  of  observations  increases. 

The  first  simulation  shows  the  performance  of  the 
method  when  observations  are  not  shared  between  ma¬ 
chines.  The  execution  time  was  computed  to  be  a 
function  of  a  scalar  parameter:  the  size  of  the  matrix. 
The  second  and  third  simulations  show  the  performance 


of  the  algorithm  when  it  is  able  to  use  observations 
on  other  machines,  taking  advantage  of  an  additional 
350  observations  uniformly  distributed  across  the  re¬ 
maining  15  machines.  In  the  second  simulation,  the  3- 
dimensional  reduced  machine  space  is  used.  Thus,  by 
using  the  size  of  the  input  data  set  and  the  3-dimensional 
embedding  as  parameters,  the  execution  time  was  a 
function  of  a  4-dimensional  parameter  vector.  The  third 
simulation  used  the  full  10-dimensional  machine  space 
in  the  multidimensional  algorithm.  In  this  case,  the  ex¬ 
ecution  time  was  a  function  of  an  11-dimensional  pa¬ 
rameter  vector.  In  all  of  these  simulations,  the  value 
of  k  used  to  make  an  estimate  of  the  execution  time  on 
a  given  machine  j  was  defined  to  be  f(0.1)(nj)^/^], 
where  nj  is  the  number  of  observations  for  machine  j, 
which  satisfies  the  bias-variance  tradeoff  requirements 
outlined  in  Section  3.1. 

As  shown  in  Figure  6,  the  ability  to  share  observa¬ 
tions  between  machines  gives  the  algorithms  used  in 
the  second  and  third  simulations  a  significant  perfor¬ 
mance  advantage  over  the  first  algorithm  when  there 
are  few  observations  from  which  to  compute  an  es¬ 
timate.  The  latter  simulations  produce  prediction  er¬ 
rors  around  50%,  versus  errors  around  500%  using  the 
first  method.  The  performance  difference  between  the 
two  observations-sharing  methods  is  negligible.  For 
larger  numbers  of  observations,  all  three  methods  per¬ 
form  equally,  with  prediction  errors  around  15%  using 
a  few  lO’s  of  observations.  To  compare  the  computa¬ 
tional  costs  of  these  three  algorithms,  Figure  7  shows 
the  measured  CPU  time  of  each  algorithm  as  a  function 
of  n  (the  size  of  the  entire  observations  set).  This  fig¬ 
ure  shows  that  using  a  reduced  parameter  space  is  con¬ 
siderably  more  efficient  than  using  the  full  parameter 
space,  making  the  reduced  parameter  space  approach 
the  best  choice  when  considering  both  accuracy  and  ef¬ 
ficiency.  The  algorithm  was  implemented  using  MAT- 
lab’s  scripting  language,  and  the  CPU  time  was  mea¬ 
sured  on  an  HP  B180L  workstation.  While  the  mea¬ 
sured  CPU  times  are  small,  it  is  likely  that  a  more  ef¬ 
ficient  implementation  coiild  result  from  a  conventional 
programming  language. 

In  the  second  set  of  experiments,  estimates  were 
computed  using  the  trial  division  algorithm  on  a  single 
machine  (no  observation  sharing  was  done  in  this  ex¬ 
periment).  As  shown  in  Figure  2,  the  execution  time  of 
this  algorithm  is  very  loosely  correlated  to  the  parame¬ 
ter  vector  X.  Thus,  in  this  extreme  example,  the  error 
in  the  execution  time  estimates  will  always  be  large,  re¬ 
gardless  of  the  number  of  past  observations.  However, 
these  experiments  demonstrate  the  utility  of  estimating 
the  sample  variance  of  the  execution  time,  and  using 
this  value  to  bound  the  execution  time.  Two  different 
simulations  were  performed,  where  m{X)  and  were 
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Figure  5. 3-Dimensional  Distance  Embedding  of  Machine  Space. 


computed  for  50  evenly-spaced  problem  sizes  using  20 
and  200  observations,  respectively.  These  results  are 
presented  in  Figures  8  and  9,  which  show  rh{X)  and 
Th{X)  -f  So-.  It  can  be  seen  in  the  figures  that  m(X) +3(7 
does  act  as  a  reasonable  upper  bound  on  the  execution 
time.  Thus,  the  combination  of  both  the  estimated  exe¬ 
cution  time  and  the  uncertainty,  a,  could  be  used  by  an 
appropriate  matching  and  scheduling  algorithm  to  make 
good  scheduling  decisions,  despite  the  fact  that  the  pa¬ 
rameter  vector  may  not  contain  sufficient  information 
to  obtain  an  accurate  execution  time  estimate.  Further¬ 
more,  Figure  8  illustrates  how  the  execution  time  esti¬ 
mate  conforms  to  the  set  of  past  observations,  where  the 
estimated  curves  form  a  distinct  “hump”  around  the  two 
observations  with  large  execution  times. 

5  Conclusions 

This  paper  presents  a  statistical  execution  time  esti¬ 
mation  algorithm  for  use  in  a  heterogeneous  distributed 
computing  environment.  This  algorithm  treats  the  exe¬ 
cution  time  as  a  random  variable,  and  makes  predictions 
using  past  observations  of  the  execution  time.  These  es¬ 
timates  compensate  for  the  properties  of  the  input  data 
set  and  the  machine  type,  without  requiring  any  direct 
knowledge  of  the  internal  operation  of  the  task  or  ma¬ 
chine.  The  random  model  allows  the  algorithm  to  de¬ 


termine  the  probable  execution  time  of  the  task,  even  in 
situations  where  the  estimate  has  a  large  amount  of  un¬ 
certainty.  This  algorithm  is  unique  in  that  it  is  able  to 
use  observations  from  dissimilar  machines  when  mak¬ 
ing  predictions,  through  the  process  of  analytic  bench¬ 
marking.  This  ability  greatly  simplifies  the  process  of 
adding  new  machines  to  the  system.  Furthermore,  an 
algorithm  is  presented  which  can  be  used  to  reduce  the 
number  of  parameters  introduced  by  the  analytic  bench¬ 
marking  process.  As  shown  in  Figure  6,  experimental 
results  indicate  that  this  method  can  make  accurate  ex¬ 
ecution  time  estimates  over  a  wide  range  of  parameter 
values  using  a  few  dozen  past  observations, 

A  Appendix:  k-NN  Regression 

The  k-NN  regression  algorithm,  and  other  statistical 
techniques  shown  in  this  section,  are  derived  from  the 
methods  surveyed  in  the  books  by  H^dle  [8]  and  Eu¬ 
bank  [6],  unless  noted  otherwise.  Given  a  parameter 
vector  X  =  •  x^]  and  a  set  of  n  previous  obser¬ 

vations  {(ft ,  Xi)  ,  the  k-NN  method  can  be  formally 
defined  as  follows.  Let 

Jx  {i  \  Xi  is  one  of  the 

k  nearest  neighbors  of  X}.  (4) 
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Number  of  Observations 

Figure  6.  Average  Prediction  Error  vs.  Number  of  Observations. 
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Figure  7.  Computation  Costs  of  Prediction  Algorithm. 


107 


Figure  8.  m(X)  and  with  20  observations. 


Figure  9.  m(X)  and  with  200  observations. 
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The  k-NN  method  uses  the  elements  in  Jx  to  form  a 
weighted  average  of  observations,  similar  to  equation  2. 

L-Smoothing  [8]  is  used  to  make  the  weighted  av¬ 
erage  robust  (i.e.,  able  to  tolerate  outliers  in  the  data 
set).  A  fixed  percentage  of  the  observations  with  the 
largest  and  smallest  t  values  are  not  included  in  the  lo¬ 
cal  average.  L-Smoothing  can  be  implemented  by  sort¬ 
ing  the  observations  {{Xi,ti)}  G  Jx  by  U,  then  com¬ 
puting  m{X)  to  be 

k-[nfk\ 

i-[^k\ 

and  to  be 

=  (6) 

i=L7*J 

The  value  of  7,  where  0  <  7  <  0.5,  controls  the  per¬ 
centage  of  observations  excluded  from  the  average. 

Next,  the  weight  Wi{X)  assigned  to  each  observa¬ 
tion  will  be  defined.  First,  consider  a  weighting  func¬ 
tion  (also  called  a  kernel  function)  for  a  single  dimen¬ 
sion  j.  It  is  desirable  for  this  weighting  function  to  give 
higher  weights  to  the  observations  closer  to  the  param¬ 
eter  value  xK  A  weighting  function  which  satisfies 
this  condition  is  the  Epanechnikov  Kernel  K{u)  [5, 6], 
where 

K{u)  =  \{l-v?)  (7) 


This  factor  ensures  that  the  sum  of  the  weights  will  be 
one. 

A  modified  kernel  function  is  used  for  parameter 
values  near  the  boundary,  in  order  to  compensate  for 
boundary  effects.  This  method,  first  presented  by 
Rice  [19],  parameterizes  the  kernel  function,  which 
eliminates  the  bias  near  the  boundarif.s,  and  yields  vari¬ 
ance  near  the  boundaries  that  is  the  same  order  of  mag¬ 
nitude  as  for  points  in  the  interior  [19]. 

To  define  this  method,  first  assume  observations  in 
dimension  j  are  bounded  to  the  interval  [a*^,  V].  Now, 
after  computing  the  set  Jx,  if  an  observation  Xi  = 
*  xf]  with  or  xl  >  V  is  in  the  set  Jx, 

a  boundary  kernel  will  need  to  be  used  for  dimension  j 
in  equation  10.  Otherwise,  the  regular  kernel  function 
is  used.  To  define  the  boundary  kernel,  first  define  the 
parameter 


and  \u\  <  1.  This  function  is  symmetric,  with  a  maxi¬ 
mum  at  u  =  0,  and  can  be  shifted,  scaled  and  normal¬ 
ized  such  that  the  point  u  =  0  corresponds  to  the  pa¬ 
rameter  value  x^.  In  this  way,  the  function  gives  higher 
weights  to  observations  near  the  parameter  value,  and 
lower  weights  to  more  distant  observations.  To  formally 
present  this  concept,  the  scaled  Epanechnikov  kernel, 
for  dimension  is  defined  to  be 

n(u)  =  m 

The  Epanechnikov  kernel  is  scaled  by  the  factor 
which,  for  the  k  points  in  Jx,  is  defined  to  be 


=  max(a:-^  ~  ^i)’  (9) 

Jx 

Given  these  definitions,  a  scaled  kernel  function 
can  be  computed  for  each  dimension  j,  where  1  < 
j  <  p.  Each  of  these  functions  can  then  be  combined 
into  a  single  multidimensional  kernel  function  Kr{U), 
where  U  =  •  •  •  u^]  and 

Kn{U)  =  f[Ki,{uj).  (10) 


Kj^iu)  -  (1  -  P^)K^{u)  +  (17) 

The  function  can  be  directly  substituted  for  the 
function  defined  in  equation  8. 
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B  Appendix:  Embedding  Points  in  an  s- 
dimensional  Space 

Potter  and  Chiang  [17]  present  an  algorithm  to  em¬ 
bed  points  in  a  lower  dimensional  space  in  a  manner 
which  attempts  to  preserve  the  distance  relationship 
between  the  points.  This  method  begins  with  the  r- 
dimensional  machine  space  defined  above,  which  con¬ 
tains  m  points  ,  Bm  G  representing 

the  available  machines.  As  mentioned  above,  the  dis¬ 
tance  relationship  between  these  m  points  will  be  repre¬ 
sented  using  a  Euclidean  distance  matrix  77^  =  {d'ij}  ^ 
IKmxm  (a  real  m  x  m  matrix)  in  Rr  >  where 

dij  =  -^\\Bi-Bj\\  (18) 

and||Z||=(El=i^,Y/^ 

The  goal  of  this  algorithm  is  to  find  a  set  of  m  points 
in  Rg »  where  s  <  r,  with  distance  matrix  Dg  providing 
the  best-fit  to  Dr>  The  algorithm  operates  as  follows. 

1.  Compute  the  Euclidean  distance  matrix  Dr  from 
the  m  points  in  Rr . 

2.  Compute  the  orthogonal  projection  matrix  P, 
which  is  defined  to  be 

P  =  I-  — ee^,  (19) 

m 

where  e  =  [11  •  •  ■  1]  is  a  vector  in  IRrn»  and  I  is 
an  m  X  m  identity  matrix. 

3.  Construct  the  matrix 

A  =  PDP'^.  (20) 

4.  Diagonalize  A  with  an  orthogonal  matrix  U  and  a 
diagonal  matrix  A,  such  that 

A  =  UAU^.  (21) 

5.  Form  A  by  retaining  the  s  largest  eigenvalues  in  A 
and  setting  the  rest  to  zero. 

6.  Compute  the  matrix 

C  =  (22) 

The  rows  of  the  matrix  C  will  give  the  coordinates  of 
the  m  points  in  Rg  which  have  a  distance  relationship 
closest  to  that  of  the  original  points  [17]. 
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Abstract 

This  paper  describes  a  simulation  tool  for  the  analysis 
of  complex  jobs  described  in  the  form  of  task  graphs.  The 
simulation  procedure  relies  on  the  PN-based  topological 
representation  of  the  task  graph  that  takes  advantage  of 
directly  modeling  precedence  constraints  and  other  char¬ 
acteristics  inherent  in  Generalized  Stochastic  Petri  Nets 
(GSPN).  The  GSPN  representation  is  enhanced  with 
enabling  functions  that  govern  the  sequence  of  firings  of 
transitions  representing  execution  of  tasks.  The  regulated 
flow  of  activity  is  carried  out  observing  not  only  prece¬ 
dence  constraints  but  specific  allocation  heuristics  and 
communication  delays.  The  tool  is  useful  in  evaluating 
different  heuristics  described  by  the  corresponding  imple¬ 
mented  algorithm,  or  using  a  deterministic  timespan  given 
by  a  Gantt  chart. 

1.  Introduction 

Task  graphs  represent  general  computation  jobs  which 
have  been  decomposed  into  modules  called  tasks  that  are 
executed  according  to  some  precedence  constraints.  Task 
graphs  are  a  well  known  tool  to  study  performance  issues 
of  complex  jobs.  A  direct  solution  technique  for  series- 
parallel  task  graphs  is  reported  in  [1];  an  average  comple¬ 
tion  time  of  the  overall  job  is  derived  assuming  no  restric¬ 
tions  exist  on  the  number  and  architecture  of  processing 
units  and  with  no  regard  to  allocation  schemes.  Execution 
times  of  fork-join  parallel  programs  in  multiprocessor 
environments  is  discussed  in  [2].  An  approach  based  on 
multiplication/convolution  is  applied  to  Heterogeneous 
Computing  Systems  (HCS)  at  coarse  and  fine  levels  of 
granularity  in  [3].  Also,  in  [4]  performance  prediction  of 
fork-join  task  graphs  is  addressed,  where  the  residence 
times  of  each  task  are  estimated  in  terms  of  service 
demands  and  queuing  delays;  based  on  these  estimations, 


the  task  graph  is  then  systematically  reduced. 

Markov-based  solutions  of  task  graph  systems  have 
been  reported  in  [5]  and  [6];  although  limited  to  relatively 
small  task  graphs,  a  Markov-based  solution  is  used  for  the 
analysis  of  scheduling  policies  in  [6].  Since  Stochastic 
Petri  Nets  (SPN)  provide  a  natural  representation  of  paral¬ 
lelism  and  synchronization  their  use  spawns  applications 
from  individual  parallel  and  concurrent  programs  to  dis¬ 
tributed  applications  and  multi-processor  systems  [7,  8, 
9].  SPN’s  can  be  used  to  directly  capture  the  topological 
information  of  a  task  graph  and  provide  a  systematic  way 
for  applying  factors  such  as  processor  heterogeneity,  allo¬ 
cation  schemes,  communication  costs,  and  random  execu¬ 
tion  times.  Also,  a  SPN-based  solution  can  be  applied  to 
arbitrary  graphs  which  are  acyclic  but  not  necessarily 
series-parallel  [10].  SPN-based  tools  automatically  gener¬ 
ate  Markov  models  that  represent  the  execution  process  of 
complex  task  graphs  where  each  state  is  given  by  the 
number  of  tasks  executing  in  parallel.  These  models  are 
then  solved  to  compute  system  performance  characteris¬ 
tics  such  as  a  distribution  of  the  overall  completion  time. 

When  the  job  represented  by  a  task  graph  is  executed 
on  the  processing  elements  of  a  HCS,  estimating  the  over¬ 
all  completion  time  becomes  an  optimization  problem 
involving  the  mapping  of  tasks  to  processors  such  that 
completion  time  is  minimized.  Mapping  tasks  to  process¬ 
ing  units  is  a  hard  problem  and  several  heuristics  have 
been  proposed  in  the  literature.  However,  before  choosing 
the  most  effective  heuristic  a  method  must  be  available  for 
computing  an  expected  completion  time  and  deriving 
execution  distributions  for  any  given  task  graph,  HCS,  and 
allocation  heuristic.  The  methodology  reported  in  [10]  to 
solve  complex  task  graphs  using  SPN’s  is  not  in  itself  an 
optimization  technique,  but  it  can  be  used  in  conjunction 
with  optimization  techniques  which  attempt  to  search  a 
space  of  completion  time  distributions.  However,  Markov- 
based  numerical  solutions  are  limited  to  exponential  dis¬ 
tributions  and  often  involve  a  large  state  space. 
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Consequently,  the  solution  process  may  be  unstable  and 
subject  to  stiffness  problems  rendering  inaccurate  results. 
Discrete  event  simulation  can  use  the  framework  provided 
by  SPN’s  [11]  and  circumvent  the  limitations  encountered 
in  the  solution  of  Markov-based  models.  The  work 
reported  in  this  paper  uses  the  SPN-based  topological  rep¬ 
resentation  of  task  graph  systems  just  as  in  [10]  but 
applies  discrete-event  simulation  to  obtain  execution  time 
distributions  and  estimates  of  the  Mean  Time  to  Comple¬ 
tion  (MTTC)  of  the  jobs  represented.  Thus  a  common 
model  based  on  SPN’s  is  used  to  drive  a  discrete  event 
simulation  of  the  overall  job.  The  method  can  be  used  to 
analyze  and  compare  several  assignment  heuristics  given 
either  the  algorithm  or  a  Gantt  chart  of  specific  assign¬ 
ment  cases.  To  illustrate  the  use  of  the  tool  several  assign¬ 
ment  heuristics  are  evaluated  and  compared. 

The  next  section  of  the  paper  introduces  the  notation 
and  parameters  used.  Basic  concepts  on  Petri  nets  are 
introduced  in  section  three  and  their  application  to 
describe  task  graphs  is  given  in  section  four.  Section  five 
deals  with  the  simulation  methodology.  A  brief  discussion 
on  allocation  heuristics  is  given  in  section  six.  The  inser¬ 
tion  of  communication  delays  is  discussed  in  section  six. 
Simulation  algorithms  are  presented  in  section  seven. 
Lastly,  applications  of  the  tool  are  discussed. 

2.  Parameters  and  Notation 

Throughout  the  paper  the  following  notation  is  used  to 
describe  the  simulation  tool  and  related  issues. 

•  a  task  graph  G{T,  E)  where  the  vertex  set 
T  =  {T1T2, ...» Ti^}  consists  of  k  tasks  which  compose 
some  overall  job  and  the  edge  set  E  consists  of  ordered 
pairs  from  T  which  correspond  to  data  or  control 
dependencies.  The  topology  of  T  is  described  in  detail 
by  the  following: 

-  an  in-degree  vector  D  =  [c?2.<^2»  •  •  • » where 

is  the  number  of  tasks  which  must  complete  before 
Ti  may  initiate  execution. 

-  an  out-degree  vector  H  =  [hi  h2, . .  • ,  /^jt]  where 
hi  is  the  number  of  tasks  which  are  spawned  after 
the  completion  of  7/. 

-  a  task  graph  structure  TG[i][j],  l<i<k, 
I  <  j  <  hi  where  TG[i]  is  an  array  specifying  the  hi 
tasks  which  are  spawned  by  the  completion  of  7^; 
thus,  the  ordered  pair  (7,-,  7G[/,  j])eE. 

•  SL  kxk  matrix  pkt[i,  7],  I  <  i,j  <  k  where  pkt[i,  j]  is 
the  average  number  of  data  packets  of  standard  size 
that  is  sent  from  7/  to  7y.  Alternatively,  these  can  be 
specified  as  edge  weights  for  the  elements  of  E. 

•  a  priority  vector  W  =  [w^  W2, ...» w^]  which  induces 
a  sequential  ordering  of  any  ready  tasks  assigned  to  the 


same  processor;  these  priorities  may  be  taken  from  the 
indices  of  the  tasks,  e.g.  w,  =  A:  -  i,  or  they  may  be 
randomly  or  determined  according  to  the  assignment 
heuristic  employed. 

•  a  set  P  =  {Pi^P2^  •  •  • » PJ  consisting  of  n  processors 
composing  a  heterogeneous  suite. 

•  a  kxn  execution  time  matrix  \<i<k, 

1  <  j  <n  where  bij  is  the  average  execution  time  of  7,- 
on  Pj. 

•  an  nXn  communication  time  matrix  C[r,5], 
1  <  r,  .s  <  n  where  each  entry  c^s  is  the  average  com¬ 
munication  time  to  transfer  a  data  packet  of  standard 
size  from  to  Pj. 

•  a  A:Xn  static  allocation  matrix  A[i,j],  l<i<ky 
l<  j  <n  where  entry  aij  =  1  if  Ti  has  been  allocated 
to  Pj,  and  0  otherwise. 

3.  Basic  Petri  Net  Concepts 

A  Petri  net  (PN)  is  a  directed,  weighted,  and  bipartite 
graph  [12].  PN’s  are  bipartite  in  that  nodes  are  of  two 
types,  places  and  transitions,  with  arcs  occurring  either 
from  places  to  transitions  or  from  transitions  to  places. 
When  an  arc  is  from  a  place  p  to  a  transition  t,  then  p  is 
an  input  place  of  a  place  p  is  an  output  place  of  t  if  an 
arc  proceeds  from  tio  p.  Places  and  transitions  are  repre¬ 
sented  pictorially  by  circles  and  thin  rectangles,  respec¬ 
tively.  A  third  component  of  any  PN  are  tokens  which 
reside  in  places;  pictorially,  tokens  are  represented  by  dots 
within  the  perimeters  of  places.  Tokens  are  transferred 
from  one  place  to  another  by  the  firing  of  transitions. 
When  a  transition  t  fires,  tokens  are  removed  from  all 
input  places  of  t  and  placed  in  the  output  places  of  t\  thus, 
enforcing  a  logical  flow  of  activity  throughout  the  net.  A 
transition  can  fire  if  it  is  enabled,  i.e.,  if  all  of  its  input 
places  possess  at  least  one  token.  An  arc  may  be 
weighted  where  the  weight  specifies  the  number  of  tokens 
which  must  reside  in  an  input  place  in  order  for  a  transi¬ 
tion  to  be  enabled,  or  the  number  of  tokens  placed  in  an 
output  place  by  the  firing  of  an  enabled  transition;  if  the 
weight  is  unspecified  then  it  is  assumed  to  be  one.  PN’s 
and  their  dynamic  behavior  can  be  captured  in  mathemati¬ 
cal  notation  via  state  vectors.  Given  a  PN  with  k  places,  a 
marking  q  of  the  PN  is  denoted  by  a  marking  is 
described  by  a  -  vector  whose  /th  component  denotes 
the  number  of  tokens  in  place  p,  ;  an  initial  marking  of  the 
PN  is  denoted  by  Mq.  A  particular  PN  with  an  underlying 
graph  N  is  denoted  {N,  Mq).  The  reachability  graph  of  a 
PN  is  a  graph  G^(M,  A)  where  the  vertex  set  M  is  the  set 
of  all  possible  markings  for  the  PN  and  the  edge  set  A 
consists  of  all  possible  transition  firings  transforming  one 
marking  into  another. 
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begin 


end 


Stochastic  Petri  nets  are  PN’s  in  which  there  is  an 
exponentially  distributed  delay  time  between  the  enabling 
and  firing  of  transitions.  The  reachability  graph  of  a 
bounded  SPN  is  isomorphic  to  a  finite  Markov  chain 
(MC)  [13];  in  particular,  the  markings  of  the  reachability 
graph  comprise  the  state  space  of  a  MC  and  the  transition 
rate  between  any  two  states  X/  and  Xj  is  the  sum  of  all  fir¬ 
ing  delays  for  transitions  transforming  M/  into  Mj.  Gen¬ 
eralized  stochastic  Petri  nets  (GSPN)  have  been  proposed 
[14]  in  which  transitions  are  of  two  types:  timed  transi¬ 
tions  which  have  the  exponentially  determined  firing  rates 
and  immediate  transitions  which  have  no  firing  delay  and 
have  priority  over  any  timed  transition.  Enabling  func¬ 
tions  are  marking-dependent  functions  which  can  be 
defined  on  each  transition  as  a  switching  mechanism. 
Transition  priorities  (timed  vs.  immediate)  and  enabling 
functions  are  logically  equivalent  extensions  of  SPN 
which  endow  them  with  the  full  computational  power  of 
Turing  machines  [15].  In  this  paper  the  notion  of  GSPN  is 
used. 

4.  GSPN  Models  of  Task  Graphs 

Task  graphs  are  assumed  to  be  series-parallel  for  sev¬ 
eral  approaches  to  performance  evaluation  [16]  and  opti¬ 
mization  [17];  however,  this  limitation  is  avoided  in  the 
PN-based  methodology  of  this  work.  Fig.  1  shows  a  sim¬ 
ple  task  graph  which  will  be  used  to  illustrate  the  transla¬ 
tion  of  task  graphs  into  GSPNs.  The  translation  of  a  task 
graph  into  a  GSPN  begins  with  the  association  of  each 
task  Ti  with  a  place/timed  transition  pair,  Pi  and  t^.  Fig. 
2  shows  the  GSPN  corresponding  to  the  task  graph  in  Fig. 
L  Auxiliary  places  xpQ  and  xp^  and  immediate 


transitions  Uq  and  zYj  are  used  to  enforce  initiation  and 
completion  conditions,  respectively,  for  the  overall  job. 
The  presence  of  at  least  one  token  in  a  place  may  repre¬ 
sent  the  fulfillment  of  all  preconditions  for  the  initiation 
of  the  task.  The  firing  of  a  timed  transition  represents  the 
completion  of  execution  of  the  corresponding  task.  The 
delay  time  of  each  transition  corresponds  to  the  exponen¬ 
tially  distributed  execution  time  of  the  task.  A  place  pi 
can  be  associated  with  the  in-degree  di  to  enforce  prece¬ 
dence  constraints.  Initially,  the  presence  of  a  token  in  xp^ 
enables  itQ\  the  firing  of  Uq  represents  the  initiation  of  an 
execution  cycle.  The  presence  of  three  tokens  in  xpi  and 
the  firing  of  iti  indicates  that  an  execution  cycle  has  been 
completed.  Timed  transitions  in  the  GSPN  model  in  Fig. 
2  will  fire  once  enabled.  Beginning  with  an  initial  mark¬ 
ing  Mq  a  sequence  of  markings  can  be  generated  to  form 
a  reachability  graph.  The  set  of  markings  generated  corre¬ 
spond  to  the  possible  execution  states  of  the  system, 
where  a  system  state  is  defined  by  the  tasks  which  are 
executing  concurrently.  If  firing  times  are  exponentially 
distributed  the  set  of  markings  generated  corresponds  to  a 
Markov  chain  that  can  be  solved  using  well  known  tools 
such  as  SPNP  [18]  or  SHARPE  [1]. 

Consider  some  marking  M,  in  which  task  should  be 
ready  to  run.  To  make  this  possible,  both  T2  and  must 
have  finished  execution;  this  will  be  indicated  by  the  pres¬ 
ence  of  two  tokens  in  i.e,  xfp^)  =  2.  To  capture  this 
precedence  constraint  it  suffices  to  associate  each  input 
arc  into  a  timed  transition  with  a  weight  corresponding  to 
the  in-degree  of  each  node  in  the  task  graph.  Alterna¬ 
tively,  the  in-degree  vector  is  associated  with  marking- 
dependent  enabling  functions. 
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Figure  2.  GSPN  model  of  the  task  graph  from  Fig.  1 


5.  Simulation  Methodology 

In  a  SPN  model,  the  firing  of  transitions  represents  the 
occurrence  of  events,  in  this  case,  execution  of  tasks.  To 
simulate  execution  of  tasks  [11],  a  clock  is  set  for  each 
newly  enabled  transition  to  keep  track  of  the  execution 
time  until  the  transition  fires.  The  simulation  procedure 
must  also  check  for  precedence  constraints,  availability  of 
processors,  and  priority  of  tasks.  When  a  transition  is 
enabled  its  firing  time  is  generated  as  a  random  variate 
from  a  selected  distribution.  Firing  times  are  recorded  by 
associating  clocks  to  transitions.  The  PN-based  simula¬ 
tion  procedure  takes  place  observing  the  following  major 
steps: 

1)  Check  for  newly  enabled  transitions, 

2)  Generate  firing  firing  times,  and 

3)  Update  clocks. 

5.1.  Enabling  Functions 

A  transition  ti  is  enabled  when  conditions  Q/,  V,-,  and 
Z/  are  satisfied.  An  entry  of  the  enabling  vector 
F  =  [fi],0<i  <  kis  evaluated  such  that  if: 

fi  =  QiViZ, 

evaluates  to  one  and  ti  can  fire. 


Condition  Q,-  checks  for  precedence  constraints,  that  is, 
when  the  number  of  tokens  w/  in  place  Pi  is  equal  to  the 
in-degree  df  of  the  vertex  representing  task  T/,  its  prece¬ 
dence  constraints  are  met,  i.e., 

1  if  nii  =  di 
0  otherwise 

Condition  checks  for  allocation  and  availability  of 
processors.  To  check  for  allocation  suffices  to  examine 
the  ith  row  of  matrix  A  for  a^j  =  1  and  then  verify  if  pro¬ 
cessor  j  is  free.  Let  a  binary  vector 
FREE  =  [freej],0  <  j  <n  - I  keep  track  of  which  pro¬ 
cessors  are  currently  free,  then 

Vi  =  ay  free  j 

If  more  than  one  transition  satisfies  condition  Q  and  V 
and  their  corresponding  tasks  are  allocated  to  the  same 
processor,  only  one  transition  should  be  enabled  (only  one 
task  should  execute)  even  though  these  tasks  could 
execute  in  parallel.  The  task  with  the  highest  priority  is 
chosen  using  the  priority  vector  W.  Let  Rdy  be  the  set  of 
transitions  representing  parallel  tasks  allocated  to  the 
same  processor.  That  is,  the  set  of  transitions  that  could 
be  enabled  from  a  current  marking  M,  then 
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fl  if  W;  =  max  {wj 

Zi  =  \ 

[0  otherwise 

Note  that  these  functions  could  be  easily  implemented 
by  incorporating  additional  places  and  transitions  to  the 
model  in  Fig.  2.  For  example  the  presence  of  a  token  in  a 
dedicated  place  can  be  used  to  model  the  availability  of  a 
processor  and  to  derive  statistical  measurements  on  the 
usage  of  that  processor  [19].  Also  additional  immediate 
transitions  can  be  used  to  model  task  priorities.  It  can  be 
argued  that  additional  modeling  elements  may  obscure  the 
representation  of  a  task  graph  and  although  they  are  use¬ 
ful,  they  become  transparent  to  the  user  when  dealing 
with  large  complex  models.  We  find  the  addition  of 
places  to  model  processing  elements  and  their  intercon¬ 
nections  useful  for  the  case  of  analyzing  the  behavior  of 
systems  running  several  jobs  modeled  by  different  task 
graphs  or  several  instances  of  the  same  job  in  an  effort  to 
capture  the  load  of  the  system,  resource  contention  and 
usage.  In  our  case  the  effect  of  external  load  is  reflected 
in  the  execution  time  of  each  subtask.  The  use  of  enabling 
functions  keeps  the  model  simple  and  the  simulation  code 
relatively  simple  as  well. 

5.2.  Firing  times 

If  a  timed  transition  is  enabled,  a  firing  time  is  gener¬ 
ated  using  a  firing  transition  rate  given  in  terms  of  the 
average  execution  times  of  each  task  obtained  from  matrix 
B.  Random  variates  are  generated  from  three  possible  dis¬ 
tributions:  exponential,  normal,  and  uniform.  The  values 
given  by  matrix  B  are  used  according  to  the  distribution 
function  selected.  Uniform  and  normal  functions  require  a 
second  value  that  must  be  provided  by  the  user.  If 
denotes  the  first  matrix  given  as  the  execution  matrix  B 
then  B2  denotes  a  second  matrix  provided  by  the  user  for 
the  case  of  normal  and  uniform  distributions.  For  expo¬ 
nential  and  normal  distributions  provides  the  average 
execution  time.  For  normal  distributions  the  matrix  B2 
provides  the  standard  deviation  pij.  In  the  case  of  a  uni¬ 
form  distribution,  matrix  Bl  provides  the  starting  point 
blij  and  matrix  B2  provides  the  ending  point  b2ij.  These 
values  are  used  to  calculate  the  mean  as  {blij  +  b2ij)f2,  A 
pseudo-random  number  u  is  generated  from  U(0, 1).  A 
firing  time  jCy  associated  to  transition  ti  is  generated  for 
each  distribution  as  follows: 

i) .  Exponential  distribution,  exp(^ly): 

Xij  =  -blj  xln(M) 

ii) .  Normal  distribution,  N(blij,  b2fj): 

a=l2 

Xij  =  (  2  “a  “  6)  X  b2ij  +  b\ij 

a-\ 


iii).  Uniform  distribution,  U{blij,  b2ij): 

Xij  =  M  X  {b2ij  -  blij)  +  blij 

5.3.  Clock  Update 

A  local  clock  that  keeps  track  of  firing  times  and  a 
global  clock  is  used  to  record  the  overall  completion  time. 
When  a  timed  transition  is  enabled  a  local  clock  is  set  to 
the  generated  firing  time  to  indicate  the  remaining  time 
until  the  transition  fires.  A  global  clock  is  denoted  as  C 
and  local  clocks  are  represented  by  a  vector 
LC  =  [/c/],  0  <  /  <  A:  -  1  where  Ici  is  the  local  clock  asso¬ 
ciated  to  transition 

Since  local  clocks  indicate  remaining  times,  they  are 
discarded  when  they  reach  0  time  units  and  the  corre¬ 
sponding  transitions  fire.  At  the  moment  a  transition  fires, 
the  global  clock  and  local  clocks  are  updated.  The  global 
clock  update  is  performed  by  adding  the  minimum  local 
clock  time  minjt  to  the  global  clock  C;  minjt  is  taken 
from  the  set  of  enabled  transitions  that  have  not  yet  fired. 
The  following  expressions  are  used  to  update  all  clocks. 

C  =  C  +  minjt 
Ici  =  ICi  -  min_t 

where  min_t  =  min  0  <  f  <  A  -  1.  Once  the  last 

i 

transition  fires,  the  global  clock  C  indicates  the  overall 
completion  time. 

6.  Heuristics 

Different  allocation  heuristics  can  be  evaluated  by 
mapping  them  into  the  allocation  matrix  A.  To  illustrate 
the  use  of  the  simulation  methodology  discussed  in  this 
paper  four  static  allocation  heuristics  are  evaluated  and 
compared. 

1.  Shortest  Estimated  Execution  Time  First  (SEETF). 
In  this  scheme  [20,  21,  22]  task  Tf  is  selected  at  ran¬ 
dom  from  the  task  set  and  assigned  to  the  processor 
that  executes  Ti  faster.  The  elements  of  the  task  alloca¬ 
tion  matrix  from  the  SEETF  algorithm  are  determined 
as  follows: 

1  if  bij  =  mm{bij} 

0  otherwise 

2.  Minimum  Finish  Time  (MFT).  In  this  allocation 
scheme  [22],  task  Tf  is  also  selected  randomly  from  a 
topologically  sorted  task  set,  i.e.  taking  into  account 
the  precedence  constraints  between  tasks.  The  selected 
processor  is  the  one  that  minimizes  the  finish  time  of  a 
task  in  a  deterministic  simulated  execution,  where  the 
finish  time  of  a  selected  task  T,  is  given  by  the 
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minimum  sum  of  its  execution  time  and  the  next 
time  instance  in  which  processor  Pj  becomes  a  free 
processor. 


Uij  - 


1 1  if  mm{bij  +  time  until  Pj  is  free} 
1 0  otherwise 


Note  that  all  tasks  are  selected  randomly  but  restricted 
to  those  tasks  whose  predecessors  have  already  been 
allocated. 


3.  Largest  Task  First  (LTF)  [22].  The  selection  of  tasks 
is  based  on  service  demands.  The  task  with  the  largest 
service  demand  is  selected  first,  or  alternatively  the 
task  with  the  largest  execution  time  is  selected  first. 
Thus: 


1  if  bij  =  mpi{bijJ 
0  otherwise 


A  processor  Pj  is  selected  randomly. 

4.  Most  Data  Task  First  (MDTF)  This  scheme  selects 
the  task  that  generates  most  data.  The  data  generated 
by  a  task  Ti  is  determined  in  terms  of  the  number  of 
data  packets  going  out,  that  is: 

k 

pkti  =  2 

Thus,  the  construction  of  the  allocation  matrix  pro¬ 
ceeds  as  follows: 


1  if  pktij  =  max  { pktj 
au  =  \  * 

0  otherwise 

and  in  this  case  also  the  processor  Pj  is  selected  ran¬ 
domly. 


7.  Communication  Delays 

As  in  [10],  two  approaches  are  presented  based  on  two 
types  of  interconnection  networks:  (a)  a  high-performance 
network  characterized  by  high-connectivity  and  parallel 
communications  and  (b)  a  bus-oriented  network  with  low- 
connectivity.  In  both  cases,  output  data  is  assumed  to  be 
accumulated  in  a  buffer  during  task  execution  and  trans¬ 
mitted  after  task  completion. 


7J-  Modeling  High-performance  Communica¬ 
tion  Networks 

High-performance  communication  networks  can  be 
characterized  as  expensive  systems  in  which  inter-node 
communication  takes  place  on  dedicated,  point-to-point 
links.  Data  intended  for  each  successor  is  written  to  a 
separate  buffer.  Furthermore,  each  processor  may  be  cou¬ 
pled  with  a  front-end  communication  processor  which 


enables  parallel  communication.  In  terms  of  a  task  graph, 
once  a  given  task  completes,  successor  tasks  experience 
an  initiation  delay  equal  to  the  data  transfer  time  for  all 
intended  packets;  ideally,  any  successor  task  allocated  to 
the  same  processor  as  the  parent  task  should  be  able  to 
begin  execution  immediately  after  the  completion  of  the 
parent  task. 

The  properties  of  such  a  high-performance  network 
can  be  modeled  in  a  GSPN  by  inserting  additional 
place/timed-transitions  to  represent  each  individual  com¬ 
munication;  augmentation  of  the  task  graph  with  commu¬ 
nication  nodes  has  been  proposed  for  CTMC-based  analy¬ 
sis  [23]  and  at  the  SPN  level  [24].  Each  timed- transition 
inserted  is  associated  with  an  exponentially  distributed 
delay  whose  parameter  is  the  average  communication 
time  between  the  host  processors.  Thus,  given  a  com¬ 
pleted  task  Ti  allocated  to  processor  Pr  and  a  successor 
task  Tj  allocated  to  P^.,  the  average  communication  rate 
assigned  to  the  transition  modeling  the  transfer  of  data  is 
given  by: 


Crs  pktij 

Fig.  5a  illustrates  a  segment  of  some  task  graph  in 
which  Task  A  spawns  tasks  B,  C,  and  D.  Suppose  the 
four  tasks  are  allocated  to  three  processors  such  that  A 
and  C  are  allocated  to  one  processor,  and  B  and  D  are 
allocated  to  the  other  two  processors,  then  the  resulting 
GSPN  for  Case  1  would  be  as  shown  in  Fig.  5b.  Note  the 
insertion  of  place/transition  pairs  between  A  and  B  and  A 
and  D  to  represent  the  individual 

In  terms  of  simulation,  communication  delays  are 
determined  from  a  distribution  function  using  the  average 
delay  Sij  and  associating  a  local  time  to  communication 
tasks. 


7.2.  Modeling  Bus-Oriented  Networks 

In  interconnection  networks  characterized  by  low-con¬ 
nectivity,  groups  of  processors  may  have  to  share  common 
communication  links,  as  is  the  case  with  a  bus-oriented 
architecture.  Also,  in  lower  cost  systems  processors  may 
be  forced  to  expend  computation  cycles  on  communica¬ 
tion  processing.  If,  additionally,  output  data  packets  for 
successor  tasks  are  queued  up  in  a  single  buffer  in  some 
random  ordering  and  transmitted  on  a  FIFO  basis,  then  it 
is  highly  unlikely  that  a  successor  task  will  receive  all  of 
its  packets  before  any  other  successor  task.  In  terms  of 
the  example  in  Fig.  5a,  if  the  processor  to  which  task  A  is 
allocated  must  broadcast  packets  in  random  order  to  the 
processors  associated  with  tasks  B,  C,  and  D  ,  then  it  is 
reasonable  to  assume  that  on  average  B,  C,  and  D  will 
experience  uniform  initiation  delay. 
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a)  Segment  of  a  task  graph 


b)  SPN  with  communication  nodes 


Figure  5.  GSPN  model  assuming  a  high-performance  network 


Such  behavior  can  be  reflected  in  the  GSPN  by  simply 
modifying  the  rate  function  governing  the  firing  of  the 
transitions  associated  with  each  task.  In  this  case,  no 
extra  nodes  are  inserted  in  the  PN  model.  Rather,  the  fir¬ 
ing  delay  of  each  transition  is  increased  by  the  sum  of 
communication  costs  associated  with  each  successor  task. 
Let  Ti  be  allocated  to  Pj  where  completion  of  spawns 
m  =  hi  tasks  which  are  allocated  to  pro¬ 

cessors  Py^y  *  *  ‘  ’  ^^^ *  ^hen  a  modified  firing  rate  for 
transition  ti  is  given  by: 


^ij  +  X  ^jyk 
k=l 

This  new  value  is  then  used  to  determine  execution 
times  from  the  distribution  function  of  choice  with  the 
value  of  jUij  determined  accordingly.  In  reality  a  given  net¬ 
work  may  be  heterogeneous  with  respect  to  interconnec¬ 
tion  capabilities.  In  this  case  the  GSPN  model  can  be  sys¬ 
tematically  constructed  to  appropriately  model  each  seg¬ 
ment  of  the  network,  reflecting  the  different  sets  of 
assumptions  mentioned  above.  The  net  result  is  that  the 
simulation  process  uses  a  GSPN  representation  with 
dynamically  determined  transition  rates  and  enabling 
functions  capturing  the  full  interplay  of  task  precedence 
relationships,  allocations  specifications,  availability  of 
idle  processors,  diverse  execution  rates  across  a  heteroge¬ 
neous  suite,  and  communication  delays. 


8.  Simulation  Algorithms 

A  simulation  algorithm  based  on  the  PN-based  topo¬ 
logical  description  of  task  graphs  is  now  described.  The 
algorithm  generates  the  MTTC  and  a  tabulation  to  plot  the 
cumulative  probability  distribution  of  the  execution  time. 
The  following  steps  summarize  the  simulation  process  for 
the  case  in  which  no  communication  delay  is  taken  into 
account: 

1)  Initialize  the  global  clock  C  and  the  initial  marking 
Mo- 

2)  Check  for  newly  enabled  transitions.  In  the  absence 
of  newly  enabled  transitions  go  to  step  5). 

3)  For  each  enabled  timed  transition  ti  generate  firing 
time  Xij. 

4)  For  each  enabled  timed  transitions,  set  the  local 
clock  ICi  to  Ici  =  Xij. 

5)  Find  the  minimum  local  clock  minjt 

6)  Fire  the  transition  with  the  minimum  clock  minj. 
Once  a  timed  transition  fires,  the  corresponding  task 
completes  execution  and  the  host  processor  is  released. 

7)  Update  global  clock  C  and  local  clocks  /c,-.  Notice 
that  by  firing  transitions  with  the  minimum  remaining 
time  equal  to  minjt,  its  ICi  =  0  and  removed  from  the 
set  of  lci"s.  The  firing  of  the  last  timed  transition  ends 
the  current  cycle.  A  new  cycle  begins  at  step  1)  by 
resetting  the  initial  marking  Mq  and  the  global  clock 
C. 

8)  Update  the  marking  record  and  repeat  from  step  2). 
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The  above  procedure  is  also  used  for  the  case  of  low- 
performance  networks  where  the  firing  rates  are  modified 
accordingly.  To  take  into  account  transfer  delays  in  a 
high-performance  network  some  modifications  are 
needed.  Let  denote  the  communication  clock 

between  task  Ti  and  task  T^.  Note  that  transition  is  a 
transition  that  has  already  fired,  that  is,  the  corresponding 
task  Ti  is  in  the  process  of  transferring  data.  After  transfer 
is  complete,  a  token  travels  to  output  place  p^.  The  set  of 
communication  clocks  ccg^  is  also  compared  with  local 
clocks  ICi  to  determine  the  minimum  time  min_t.  Note 
that  the  set  of  Zc^’s  corresponds  to  transitions  ti  that  have 
been  enabled  but  are  not  yet  transferring  data.  If  the  minj 
selected  corresponds  to  a  local  clock  Zcp  then  transition  r,- 
fires,  else,  the  minJ  corresponds  to  a  communication 
clock  and  a  token  is  now  transferred  to  a  destination  place 
p^.  Steps  1)  to  4)  are  the  same  and  the  rest  of  the  algo¬ 
rithm  is  modified  as  follows: 

5)  Find  the  minimum  local  clock: 

minjt  =  min  {Ici,  cci^}. 

ih 

6)  Update  global  clock  C,  local  clocks  Ici,  and  commu¬ 
nication  clocks  ccih : 

C  =  C  +  min_t 
ICi  =  ICi  -  minjt 
cCih  =  cCih  -  minjt 

7)  If  min_t  corresponds  to  a  local  clock  Ici,  then: 

7.1)  Transition  fires.  Tokens  are  removed  from 
the  input  places  and  the  corresponding  processor  is 
released. 

7.2)  If  ti  is  the  last  transition,  then  stop  the  cycle. 

7.3)  Generate  communication  delays  and  set  com¬ 
munication  clocks  to  cCih  =  — . 

^ih 

8)  If  min_t  corresponds  to  a  communication  clock  then 

transfer  a  token  to  output  place  p/,. 

9)  Update  the  marking  record  and  go  back  to  step  2). 

9.  Applications 

A  hypothetical  13-node  task  graph  is  shown  in  Fig.  6. 
This  graph  was  used  in  [10]  to  illustrate  a  PN-based 
numerical  approach  to  the  solution  of  complex  task 
graphs.  The  simulation  procedure  is  applied  to  the  task 
graph  and  compared  with  the  results  rendered  by  the 
SPNP  tool  [18].  The  static  allocation  scheme  used  maps 
tasks  to  processors  such  that  a  task  T,  is  assigned  to  pro¬ 
cessor  Pj  where  j  =  i  modn.  The  edge  weight  shown  in 
Fig.  6  correspond  to  the  number  of  standard  sized  packets 


generated  and  sent  to  successor  tasks. 

The  following  matrix  B  specifies  the  spectrum  of 
execution  times  for  each  task  across  six  processing  units 
in  the  system  in  standard  time  units  per  execution: 

■,9  2  .3  1  .3  4  2  1  3  2  .3  .2  .1" 
.34.31.352134.5.5.1 
7,  __  .5  1  .5  1  .3  5  2  1  4  2  5  .2.3 

^  “  .5  2  .2  2  .3  5  2  2  2  2  .5  .2  .1 

,5  2  .3  1  .6  5  3  1  3  2  .5  .2  .1 

.5  2  .3  1  .3  7  1  1  3  2  .3  .2  . 1_ 

The  communication  delays  per  data  packet  in  the  inter¬ 
connection  network  between  the  six  processors  are  char¬ 
acterized  by  the  matrix  C  in  terms  of  standard  time  units 
per  packet: 

"0  .1  .1  .2 .2  .r 

.1  0  .4  .3  .2  .1 

^  .1  .4  0  .2.3.3 

.2 .3 .2  0  .3 .2 
.2 .2 .3 .3  0  .1 
_.i  .1 .3 .2 .1 0 _ 

Relative  priorities  among  the  13  tasks  are  specified  thus: 

W  =  [13  12  11  8  9  107  65  43  1  2] 

It  should  be  noted  that  this  priority  scheme  is  entirely 
arbitrary  as  is  the  allocation  scheme.  The  numerical  and 
simulation  results  shown  in  Fig.  7  correspond  to  the  prob¬ 
ability  of  completion  at  time  t,  P(X  <  t)  of  the  overall  job 
based  on  three  communication  scenarios:  a)  there  are  no 
communication  costs,  b)  communication  occurs  over  a 
high-  performance  network,  and  c)  communication  takes 
place  over  a  low-performance  network.  The  MTTC 
results  along  with  confidence  intervals  are  given  in  Table 
1.  Up  to  1000  task  graphs  were  simulated  and  the  time  to 
render  averaged  results  took  about  1.69  secs,  compared 
with  125.13  secs,  needed  by  the  numerical  tool  (SPNP)  in 
a  Sparc  classic  workstation.  This  difference  is  in  part  due 
to  the  large  number  of  states  generated.  For  the  case  of  the 
low  performance  network,  SPNP  took  2.40  secs,  while 
the  simulation  process  took  0.63  secs  [25]. 

A  second  application  consists  in  evaluating  the  task 
graph  shown  in  Fig.  8.  This  20-node  task  graph  describes 
the  LU  decomposition  algorithm  common  in  the  solution 
of  linear  systems  encountered  in  many  scientific  applica¬ 
tions.  Several  schedules  for  different  heuristics  were 
derived  in  [26].  Two  heuristics  the  Heavy  Node  First 
(HFN)  and  Weighted  Length  (WL)  were  examined  to 
determine  the  corresponding  assignment  matrices 
and  respectively: 
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Figure  7.  CDF  of  completion  time  given  static  allocation  and  network  type 


Table  1.  Comparison  of  MTTC  results 


Case 

Numerical 

MTTC 

Simulation 

MTTC 

99%  confidence  intervals 

High-Performance 

Network 

18.8269 

18.5959 

17.9854-19.2063 

Low-Performance 

Network 

23.5204 

23.5772 

22.6119-24.5426 

No-communication 

Costs 

14.1999 

14.1491 

13.5817-14.7165 
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used  in  the  simulation. 


10.  Conclusions 
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Both  heuristics  are  based  on  the  execution  times 
(weights)  of  each  task.  The  HNF  heuristic  examines  the 
task  graph  level  by  level  assigning  the  heaviest  nodes  first. 
The  WL  heuristic  assigns  control  nodes  first  by  associat¬ 
ing  a  rank  determined  in  terms  of  the  length  of  an  exit 
path,  branching  factor,  number  of  depending  tasks  in  the 
path  and  their  weights.  For  further  details  see  [26].  The 
schedules  reported  in  the  form  of  Gantt  charts  were 
derived  assuming  the  following: 

1)  The  processing  units  are  identical, 

2)  A  communication  over  processing  time  ratio  very 
high.  Consequently,  communication  delays  are 
assumed  negligible,  and 

3)  Execution  times  as  shown  in  Fig.  7. 

The  simulation  of  these  two  heuristics  under  a  uniform 
distribution  with  zero  variance  rendered  the  same  total 
execution  time  of  96  units.  Again  examining  the  sched¬ 
ules  the  following  priority  vectors  were  obtained: 

=  [20  19  16  15  14  13  8  9  12  18  7  3  6  11  17  4  2  5  10  1] 


WwL  =  [20  19  17  16  15  11  8  10  14  18  7  3  6  12  13  4  2  5  9  1] 

Thus,  any  instance  of  heuristics  given  in  the  form  of 
Gantt  charts  such  as  those  derived  to  compare  decluster¬ 
ing  techniques  in  [27]  can  be  similarly  characterized  for 
the  simulation  procedure.  Fig.  9  shows  the  plots  obtained 
in  the  evaluation  of  SEETF,  MFT,  LTF,  HNF,  and  WL 
heuristics  under  the  assumption  of  exponentially  dis¬ 
tributed  execution  times.  The  priority  vectors  for  the  first 
three  schemes  were  derived  as  the  assignments  were 
made.  The  results  show  that  HNF  and  WL  perform  better 
as  expected.  Again  1000  copies  of  the  task  graph  were 


The  numerical  solution  of  task  graphs  based  on  a 
GSPN  model  is  limited  to  execution  times  that  are  expo¬ 
nentially  distributed.  A  reliable  evaluation  of  large  com¬ 
plex  task  graphs  is  not  guaranteed  as  it  involves  the  solu¬ 
tion  of  an  underlying  very  large  state  space.  One  way  to 
circumvent  this  problem  is  using  simulation.  The  simula¬ 
tion  technique  discussed  relies  on  a  the  PN-based  topol¬ 
ogy  of  a  given  task  graph.  Besides  naturally  capturing  the 
dynamics  of  a  job  execution,  another  advantage  in  relying 
on  a  PN-based  topology  is  that  a  common  model  is  used 
for  both  a  numerical  and  a  simulation-based  analysis.  This 
is  useful  in  the  development  of  a  user  interface  currently 
under  construction  that  incorporates  both  methods  of 
solution. 

The  simulation  tool  presented  facilitates  the  analysis 
and  comparison  of  allocation  heuristics.  This  is  illustrated 
by  the  evaluation  of  four  heuristics  for  a  particular  appli¬ 
cation.  Results  are  reported  to  compare  the  behavior  of 
two  types  of  networks  and  a  comparison  is  made  between 
simulation  results  and  those  obtained  using  a  numerical 
evaluation.  The  results  of  these  comparisons  validate  the 
simulation  tool  implemented.  It  turns  our  that  the  simula¬ 
tion  algorithm  implemented  is  faster  than  the  numerical 
solution  of  the  cases  reported  because  of  the  largeness 
problem.  However,  an  interface  currently  under  develop¬ 
ment  is  required  to  handle  large  applications  involving 
thousands  of  tasks.  Also,  the  tool  can  be  used  to  explore 
and  determine  optimal  size  of  networks  in  terms  of  the 
number  of  processors  to  achieve  the  best  performance  of  a 
particular  application.  Since  simulation  avoids  the  prob¬ 
lem  of  state  explosion  present  in  Markov-based  models,  a 
useful  extension  to  this  tool  must  include  the  analysis  of 
multiple  task  graphs  for  which  the  use  of  color  Petri  nets 
would  be  more  suitable. 
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Figure  8.  Task  Graph  for  the  LU  decomposition  algorithm 
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Abstract:  Networks  of  Workstations  (NOW)  have  become 
an  attractive  alternative  platform  for  high  performance 
computing.  Due  to  the  commodity  nature  of  workstations 
and  interconnects  and  due  to  the  multiplicity  of  vendors 
and  platforms,  the  NOW  environments  are  being  gradu¬ 
ally  redefined  as  Heterogeneous  Networks  of  Workstations 
(HNOW).  Having  an  accurate  model  for  the  communication 
in  HNOW  systems  is  crucial  for  design  and  evaluation  of 
efficient  communication  layers  for  such  systems.  In  this  pa¬ 
per  we  present  a  model  for  point-to-point  communication  in 
HNOW  systems  and  show  how  it  can  be  used  for  character¬ 
izing  the  performance  of  different  collective  communication 
operations.  In  particular,  we  show  how  the  performance  of 
broadcast,  scatter,  and  gather  operations  can  be  modeled 
and  analyzed.  We  also  verify  the  accuracy  of  our  proposed 
model  by  using  an  experimental  HNOW  testbed.  Further¬ 
more,  it  is  shown  how  this  model  can  be  used  for  compar¬ 
ing  the  performance  of  different  collective  communication 
algorithms.  We  also  show  how  the  effect  of  heterogeneity 
on  the  performance  of  collective  communication  operations 
can  be  predicted. 

1  Introduction 

The  availability  of  modem  networking  technologies  [1, 
4]  and  cost-effective  commodity  computing  boxes  is  shift¬ 
ing  the  focus  of  high  performance  computing  systems  to¬ 
wards  Networks  of  Workstations  (NOW).  The  NOW  sys¬ 
tems  comprise  of  clusters  of  PCs/workstations  connected 
over  the  Local  Area  Networks  (LAN)  and  provide  an  at- 
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tractive  price  to  performance  ratio.  Many  research  projects 
are  currently  in  progress  to  provide  efficient  communication 
and  synchronization  for  NOW  systems.  However,  most  of 
these  projects  focus  on  homogeneous  NOWs,  systems  com¬ 
prising  of  similar  kinds  of  PCs/workstations  connected  over 
a  single  network  architecture.  The  inherent  scalability  of 
the  NOW  environment  combined  with  the  commodity  na¬ 
ture  of  the  PCs/workstations  and  networking  equipment  is 
forcing  the  NOW  systems  to  become  heterogeneous  in  na¬ 
ture.  The  heterogeneity  could  be  due  to:  1)  the  difference 
in  processing  and  communication  speeds  of  workstations, 
2)  coexistence  of  diverse  network  interconnects,  or  3)  avail¬ 
ability  of  alternative  communication  protocols.  This  adds 
new  challenges  to  providing  fast  communication  and  syn¬ 
chronization  on  HNOWs  while  exploiting  the  heterogene¬ 
ity. 

The  need  for  a  portable  parallel  programming  environ¬ 
ment  has  resulted  in  development  of  software  tools  like 
PVM  [22]  and  standards  such  as  the  Message  Passing  In¬ 
terface  (MPI)  [14,  20].  MPI  has  become  a  commonly  ac¬ 
cepted  standard  to  write  portable  parallel  programs  using 
the  message-passing  paradigm.  The  MPI  standard  defines  a 
set  of  primitives  for  point-to-point  communication.  In  ad¬ 
dition,  a  rich  set  of  collective  operations  (such  as  broad¬ 
cast,  multicast,  global  reduction,  scatter,  gather,  complete 
exchange  and  barrier  synchronization)  has  been  defined  in 
the  MPI  standard.  Collective  communication  and  synchro¬ 
nization  operations  are  frequently  used  in  parallel  applica¬ 
tions  [3,  5,  7,  13,  15,  18,  21].  Therefore,  it  is  important 
that  these  operations  are  implemented  on  a  given  platform 
in  the  most  efficient  manner.  Many  research  projects  are 
currently  focusing  on  improving  the  performance  of  point- 
to-point  [17, 23]  and  collective  [11,16]  communication  op¬ 
erations  on  NOWs.  However,  none  of  these  studies  ad- 
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dress  heterogeneity.  The  ECO  package  [12],  built  on  top 
of  PVM,  uses  pair-wise  round-trip  latencies  to  character¬ 
ize  a  heterogeneous  environment.  However,  such  latency 
measurements  do  not  show  the  impact  of  heterogeneity  on 
communication  send/receive  overhead  (the  fixed  component 
as  well  as  the  variable  component  depending  on  the  mes¬ 
sage  length).  A  detailed  model  is  necessary  to  develop  opti¬ 
mal  algorithms  for  collective  communication  operations  for 
a  given  KNOW  system.  The  existence  of  such  a  model  also 
allows  to  predict/evaluate  the  impact  of  an  algorithm  for  a 
given  collective  operation  on  a  KNOW  system  by  simple 
analytical  modeling  instead  of  a  detailed  simulation. 

In  our  preliminary  work  along  this  direction  [2],  we 
demonstrated  that  heterogeneity  in  the  speed  of  worksta¬ 
tions  can  have  significant  impact  on  the  fixed  component 
of  communication  send/receive  overhead.  Using  this  sim¬ 
ple  model,  we  demonstrated  how  near-optimal  algorithms 
for  broadcast  operations  can  be  developed  for  HNOW  sys¬ 
tems.  However,  this  model  does  not  take  care  of  the  vari¬ 
able  component  of  communication  overhead  and  the  trans¬ 
mission  component. 

In  this  paper,  we  take  on  the  above  challenges  and  de¬ 
velop  a  detailed  communication  model  for  collective  oper¬ 
ations.  First  we  develop  a  model  for  point-to-point  com¬ 
munication.  We  present  a  methodology  to  determine  differ¬ 
ent  components  of  this  model  through  simple  experiments. 
Next,  we  show  how  this  model  can  be  used  for  evaluating 
the  performance  of  various  collective  communication  oper¬ 
ations  based  on  the  configuration  of  the  system  and  the  al¬ 
gorithm  used  for  that  collective  operation.  The  correctness 
of  this  model  is  validated  by  comparing  the  results  predicted 
by  this  model  with  the  results  gathered  from  our  experimen¬ 
tal  testbed  for  different  system  configurations  and  message 
sizes.  Finally,  we  illustrate  how  this  model  can  be  used  by 
algorithm  designers  to  select  strategies  for  developing  opti¬ 
mal  collective  communication  algorithms  and  by  program¬ 
mers  to  study  the  impact  of  a  HNOW  system  configuration 
on  the  performance  of  a  collective  operation. 

This  paper  is  organized  as  follows.  The  communica¬ 
tion  model  for  point-to-point  operations  is  presented  in  Sec¬ 
tion  2.  The  communication  model  for  a  set  of  collective 
operations  and  the  evaluation  of  these  models  are  discussed 
in  Section  3.  In  Section  4,  it  is  shown  how  the  proposed 
model  can  be  used  for  comparing  different  schemes  for  im¬ 
plementing  collective  communication  operations.  It  is  also 
shown  how  we  can  predict  the  performance  of  collective  op¬ 
erations  with  changes  in  system  configuration.  We  conclude 
the  paper  with  future  research  directions. 


and  receive  costs  on  the  various  nodes  is  imperative.  In  most 
implementations  of  MPI,  such  as  MPICH  [8],  collective 
communication  is  implemented  using  a  series  of  point-to- 
point  messages.  Hence,  from  the  characterization  of  point- 
to-point  communication  on  the  various  type  of  nodes  in  a 
heterogeneous  network,  the  cost  of  any  collective  commu¬ 
nication  operation  can  be  estimated.  This  section  explains 
the  method  adopted  to  determine  the  send  and  receive  costs 
as  a  function  of  the  message  size  for  point-to-point  commu¬ 
nication  in  a  heterogeneous  network. 

2.1  Characterization  of  Communication  Compo¬ 
nents 


The  cost  of  a  point-to-point  message  transfer  consists  of 
three  components,  namely,  the  send  overhead,  transmission 
cost  and  the  receive  overhead.  These  components  comprise 
of  a  message  size  dependent  factor  and  a  constant  factor. 
Thus,  the  one-way  time  for  a  single  point-to-point  message 
between  two  nodes  can  be  expressed  as: 


Tptp 
O  send 
Otrans 


Or 


—  Osend  "I"  Otrans  "b  Oreceive 
_  (^sender  ^sender  ^  ^ 

=  Xc  +  Xm  ■  rn 

=:  ^receiver  ^  Receiver  .  ^ 


(1) 

(2) 

(3) 

(4) 


where  m  is  the  message  size  (in  bytes).  The  components  Sc, 
Xc,  Bind  Rc  are  the  constant  parts  of  the  send,  transmission 
and  receive  costs  respectively.  The  components  Sm  * 

Xm  •  rn,  and  Rm-rn  are  the  message  dependent  parts. 

To  obtain  an  empirical  characterization  of  point-to-point 
communication,  each  term  in  these  equations  needs  to  be 
measured.  In  order  to  measure  these  terms  we  need  two 
sets  of  experiments:  ping-pong  experiment  and  consecutive 
sends  experiment.  The  time  for  completing  a  point-to-point 
message  between  a  pair  of  nodes  can  be  measured  using  a 
ping-pong  experiment.  In  this  experiment,  one  of  the  nodes 
performs  a  send  followed  by  a  receive  while  the  other  does 
a  receive  followed  by  a  send.  The  round-trip  time  is  de¬ 
termined  by  averaging  over  multiple  such  iterations.  If  the 
nodes  involved  are  identical,  the  time  for  a  point-to-point 
message  transfer  is  equal  to  half  the  round-trip  time.  For 
small  messages,  the  message  size  dependent  cost  can  be  ig¬ 
nored  compared  to  the  constant  costs.  Therefore, 

'Rptp, small  ^  "i^Rping— pong, small 
^  ^sender  ^receiver 


2  Modeling  Point-to-point  Communication 

For  the  characterization  of  collective  communication  op¬ 
erations  on  heterogeneous  networks,  the  knowledge  of  send 


The  send  component  of  point-to-point  messages  (Sc)  in 
the  above  equation  can  be  obtained  by  measuring  the  time 
for  a  small  number  of  consecutive  sends  from  one  of  the 
nodes  and  averaging  the  time  over  the  number  of  sends. 
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Using  the  measured  value  of  Sc  and  the  measured  value  of 
Tptp^smaih  Rc  can  be  obtained  from  Equation  5.  The  con¬ 
secutive  sends  experiment  can  be  used  repeatedly  for  vary¬ 
ing  message  sizes  to  calculate  the  send  overhead  for  those 
message  sizes.  The  linear  fit  of  the  send  overheads  can  be 
plotted  as  a  function  of  message  size.  The  component  Sm 
is  equal  to  the  slope  of  the  straight  line  fitted  on  this  plot. 
The  transmission  cost  (Ptrans)  can  be  determined  from  the 
configuration  of  the  underlying  network.  Then,  Rm  can  be 
measured  by  taking  the  difference  of  the  Tptp  obtained  from 
the  ping-pong  experiment  and  the  sum  of  all  other  compo¬ 
nents  in  Equation  1. 

2.2  Measurement  of  Communication  Compo¬ 
nents 

In  this  section  we  describe  the  testbed  used  to  verify  our 
proposed  model  and  present  the  measured  values  of  differ¬ 
ent  components  of  point-to-point  communication. 

We  used  two  types  of  personal  computers  in  our  testbed. 
The  first  group  of  nodes  were  Pentium  Pro  200MHz  PCs 
with  128MB  memory  and  16KB/256KB  L1/L2  cache. 
These  machines  are  referred  to  as  slow  nodes  in  the  rest 
of  the  paper.  The  second  group  of  nodes  were  Pentium 
II  300MHz  PCs  with  128MB  memory  and  32KB/256KB 
L1/L2  cache.  We  refer  to  these  machines  as  fast  nodes.  All 
the  nodes  were  connected  to  a  single  Fast  Ethernet  switch 
with  a  peak  bandwidth  of  lOOMbits/sec  and  a  Sfjsec/port 
latency.  Thus,  the  transmission  cost  (Xm  *  tu  -f  Xc)  for  our 
testbed  can  be  expressed  as  (m/12.5  +  2  *  S)psec.  The 
factor  of  two  in  this  expression  is  used  to  reflect  the  effect 
of  latencies  for  both  input  and  output  switch  ports  in  the 
one-way  communication. 

We  performed  the  experiments  described  in  Section  2.1 
for  both  slow  and  fast  nodes.  The  components  correspond¬ 
ing  to  the  slow  and  fast  nodes  are  shown  in  Table  1. 


Table  1.  Send  and  Receive  parameters. 


Sc  itisec) 

wnuMW 

slow 

0.18 

0.08 

mi 

Observation  1  The  send  overhead  and  receive  overhead 
can  be  different  in  a  heterogeneous  environment  Using  the 
same  value  for  both  overheads  or  just  considering  the  send 
overhead  may  result  in  inaccurate  models. 

We  also  verified  these  values  by  measuring  the  one  way 
latency  of  messages  between  different  types  of  nodes.  Ta¬ 
ble  2  illustrates  the  latency  (of  zero-byte  messages)  between 
fast  and  slow  nodes.  The  experimental  results  correspond¬ 
ing  to  one-way  latencies  agree  with  those  calculated  from 


Equations  1  through  4.  It  should  be  noted  that  the  same 
set  of  experiments  can  be  used  for  systems  with  multiple 
types  of  computing  nodes.  In  general,  for  a  system  with 
n  different  types  of  computers,  the  experiments  should  be 
performed  n  times  (once  for  each  type). 


Table  2.  One-way  latency  (microsec)  between 
different  types  of  nodes  in  our  testbed. 


Sender  Type 

Receiver  Type 

Time  (psec) 

Fast 

Fast 

170 

Fast 

Slow 

200 

inHI 

Fast 

Slow 

Slow 

In  heterogeneous  environments,  there  is  a  possibility  of 
having  different  types  of  machines  at  the  sending  and  re¬ 
ceiving  sides  of  a  point-to-point  communication.  Thus, 
using  a  simple  model  for  communication  cost  such  as 

Tptp  —  ^constant  “1“  ^per—byte  with  the  same  Oconstant 

and  Oper-byte  for  all  pairs  of  nodes  will  not  be  sufficient. 

3  Modeling  Collective  Communication  Oper¬ 
ations 

The  MPI  standard  provides  a  rich  set  of  collective  com¬ 
munication  operations.  Different  implementations  of  MPI 
use  different  algorithms  for  implementing  these  operations. 
However,  most  of  the  implementations,  specially  those  used 
for  NOW  environments,  implement  the  different  collec¬ 
tive  operations  on  top  of  point-to-point  operations.  Dif¬ 
ferent  types  of  trees  (e  g.  binomial  trees,  sequential  trees, 
and  k-trees)  can  be  used  for  implementing  these  opera¬ 
tions  [6,  9,  10,  19].  We  can  classify  the  MPI  collec¬ 
tive  communication  operations  into  three  major  categories: 
one-to-many  (such  as  MPI3cast  and  MPIJScatter),  many- 
to-one  (such  as  MPLGather),  and  many-to-many  (such  as 
MPI-Allgather  and  MPI-Alltoall).  We  present  the  analysis 
for  some  of  the  representative  operations. 

In  the  following  sections,  we  provide  analytical  models 
for  the  broadcast,  scatter,  and  gather  operations  as  they  are 
implemented  in  MPICH.  We  also  verify  the  accuracy  of  our 
analytical  models  by  comparing  the  estimated  times  with 
measured  times  on  the  experimental  testbed.  It  should  be 
noted  that  our  model  does  not  consider  the  effect  of  con¬ 
tention  and  is  applicable  only  to  fully-connected  systems. 

3.1  Broadcast  and  Multicast 

Binomial  trees  are  used  by  MPICH  for  implementing 
broadcast  and  multicast.  Figure  la  illustrates  how  broad¬ 
cast  is  performed  on  a  four-node  system.  The  completion 
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(a)  Broadcast  Tree  (b)  Scatter  Tree  (c)  Gather  Tree 

Figure  1.  Trees  used  in  MPiCH  for  the  broad¬ 
cast,  scatter,  and  gather  operations.  The 
numbers  inside  the  tree  nodes  indicate  the 
rank  of  the  computing  nodes  assigned  to 
those  tree  nodes. 


time  of  broadcast  on  a  n-node  system  can  be  modeled  by 
using  the  following  expression: 


Tbroadcaat  —  ^recu’  ‘  ‘ )  '^recv  } 


where  is  the  time  when  node  i  receives  the  entire  mes¬ 
sage.  The  value  of  Trecv  for  the  root  of  broadcast  is  obvi¬ 
ously  zero,  and  Trecv  for  all  other  nodes  can  be  calculated 
from  the  following  recurrence: 

rpi  _  rpparent{i)  , 

■^recv  -^recv  ' 

childTank{parent{i)^i)  • 

f^gparent(i)  ^  g^rentiO  .  msg.size)  + 

Xm  •  msgsize  Xc  -\r 

•  msg^size  -f  i?*  (7) 


v/hcTC  parent  (i)  indicates  the  parent  of  node  i  in  the  broad¬ 
cast  tree  and  childrank{parent{i),i)  is  the  order,  among 
its  siblings,  in  which  node  i  receives  the  message  from  its 
parent. 

This  model  can  be  used  to  estimate  the  completion  time 
of  a  broadcast  operation  on  a  given  heterogeneous  system. 
In  order  to  verify  the  accuracy  of  this  model,  we  also  mea¬ 
sured  the  completion  time  of  MPIJBcast  on  our  experimen¬ 
tal  testbed.  The  procedure  in  Figure  2  was  used  for  measur¬ 
ing  the  completion  time  of  MPI-Bcast  (or  other  MPI  collec¬ 
tive  operations).  In  order  to  minimize  the  effect  of  external 
factors,  this  procedure  was  executed  several  times  and  the 
minimum  of  the  measured  times  is  reported. 

The  comparison  between  the  measured  and  estimated 
completion  times  of  broadcast  on  a  four-node  system  with 
four  different  configurations  is  shown  in  Figure  3.  It  can 
be  observed  that  the  estimated  times  using  our  analytical 
models  are  very  close  to  the  measured  times.  It  is  also  in¬ 
teresting  to  note  that  the  completion  time  of  broadcast  for  a 
system  with  two  fast  nodes  and  two  slow  nodes  varies  with 
the  order  of  the  nodes  in  the  binomial  tree  (configurations  B 
and  D). 


MPI-Banier 
get  start-time 
for  (i=0;  i  <  ITER;  i++)  { 

MPI-Barrier 

} 

get  end-time 

barrier-time=  (end-time  -  start-time)  /  ITER 

MPI-Barrier 

get  start-time 

for  (i=0;  i  <  ITER;  i++)  { 

MPI-Bcast  /*  or  any  other  collective  operation  */ 
MPI-Barrier 

} 

get  end-time 

local-time=  (end-time  -  start-time)  /  ITER 
global-time=  reduce(local-time,  MAXIMUM) 
broadcast-time=  global-time-barrier-time 

Figure  2.  Outline  of  the  procedure  used  for 
measuring  the  completion  time  of  broadcast 
and  other  collective  operations. 


Observation  2  The  completion  time  for  a  given  broadcast 
tree  depends  not  only  on  the  type  of  participating  nodes,  but 
also  on  how  these  nodes  are  assigned  to  the  broadcast  tree 
nodes. 


3.2  Scatter 


Scatter  is  another  one-to-many  operation  defined  in  the 
MPI  standard.  Sequential  trees  are  used  to  implement  scat¬ 
ter  in  MPICH^  Figure  lb  illustrates  how  scatter  is  imple¬ 
mented  on  a  four-node  system.  The  completion  time  of 
scatter  on  an  n-node  system  can  be  modeled  as  follows: 

^scatter  “  ^rect;>  *  ’  '  ’  '^recv  } 

Trecv  —  childrank{root^  i)  * 

(^gparentii)  g^rent(i)  .  jy^ggsize)  -h 

Xm  •  msg^size  -j-  Xc  + 

•  msgsize  -h  R].  (9) 


Figure  4  compares  the  estimated  completion  times 
(based  on  Equation  8)  of  the  scatter  operation  with  those 
measured  on  the  experimental  testbed  with  four  nodes  and 
different  configurations.  It  can  be  observed  that  the  esti¬ 
mated  times  (using  our  analytical  models)  are  very  close  to 
the  measured  times. 


should  be  noted  that  in  the  scatter  operation,  as  implemented  in 
MPICH,  a  message  is  sent  from  the  root  to  itself  through  point-to-point 
communication.  Since  the  time  for  this  operation  is  small  in  comparison 
with  the  total  completion  time  of  scatter  we  ignore  it  in  our  analysis.  We 
use  the  same  approach  when  we  model  the  gather  operation. 
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Figure  3.  Estimated  and  measured  compietion  times  for  the  broadcast  operation  on  four  nodes  with 
four  different  configurations.  The  types  of  the  participating  nodes  from  node  0  to  node  3  (In  MPI 
ranking)  are:  A:  fast,  fast,  fast,  fast;  B:  fast,  fast,  slow,  slow;  C:  slow,  fast,  slow,  fast;  D:  fast,  slow, 
fast,  slow. 


Figure  4.  Estimated  and  measured  completion  times  for  the  scatter  operation  on  four  nodes  with 
three  different  configurations.  The  types  of  the  participating  nodes  from  node  0  to  node  3  (in  MPI 
ranking)  are:  A:  fast,  fast,  fast,  fast;  B:  fast,  fast,  slow,  slow;  C:  slow,  fast,  slow,  fast. 
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33  Gather 

Gather  is  a  many-to-one  collective  operation  imple¬ 
mented  in  MPICH  by  using  reverse  sequential  trees 
(Fig.  Ic).  The  completion  time  of  this  operation  can  be 
modeled  by  using  the  following  expression: 


Tgather  — 

/ynn— 1 

■^receive 

(10) 

rpi  _ 

■^receive 

J  ^arrive  } 

(11) 

rpi  _ 

arrive 

(  The  ith  smallest  element  of 

{5^  +  Sin  •  rnsg^size} 
for  l<j<n  —  1  )  + 

i  •  {Xm  *  msgsize  +  Xc) 

(12) 

where  T^^ceive  is 

the  time  by  which  the  messages  from  i 

nodes  (out  of  the  the  n  - 1  sender  nodes)  have  been  received 
at  the  root  node.  T^^rive  is  the  time  when  the  ith  message 
has  arrived  at  the  root  node. 

The  estimated  completion  times  using  our  models  and 
the  measured  completion  times  for  a  four-node  system  with 
different  configurations  are  shown  in  Figure  5.  It  can  be 
seen  that  the  estimated  times  are  close  to  the  measured 
times. 

4  Applying  the  Communication  Model 

In  this  section,  we  explain  how  the  models  developed  in 
the  previous  sections  can  be  used  for  evaluating  different 
collective  communication  algorithms.  We  also  discuss  how 
these  models  can  be  used  to  characterize  the  effect  of  the 
heterogeneous  configuration  (the  number  of  nodes  from  dif¬ 
ferent  types).  Without  loss  of  generality,  we  consider  sys¬ 
tems  with  two  different  types  of  computing  nodes. 

4«1  Choice  of  Algorithms 

Collective  communication  operations  can  be  imple¬ 
mented  by  using  different  algorithms  (which  use  different 
types  of  trees).  For  instance,  MPICH  and  many  other  com¬ 
munication  libraries  use  the  binomial  trees  for  broadcast. 
However,  it  has  been  shown  that  binomial  trees  are  not  the 
best  choice  for  all  systems  [2].  Therefore,  in  order  to  find 
the  best  scheme  for  a  given  collective  operation,  it  is  impor¬ 
tant  to  compare  the  performance  of  different  schemes.  Our 
proposed  communication  model  can  be  used  to  evaluate  the 
performance  of  these  algorithms  analytically. 

Consider  an  8-node  system  (four  fast  nodes  and  four 
slow  nodes  with  characteristics  described  in  Section  2.2) 
and  three  different  trees  (as  shown  in  Fig.  6)  for  the  broad¬ 
cast  operation.  The  estimated  completion  times  of  broad¬ 
cast  on  this  system  using  different  tree  structures  are  shown 


in  Figure  7a.  It  can  be  seen  that  for  the  given  configura¬ 
tion,  the  binomial  tree  performs  worse  than  the  other  trees. 
For  messages  up  to  bytes  in  size,  the  hierarchical  tree 
performs  better  than  others.  For  messages  larger  than  AK 
bytes,  the  sequential  tree  has  the  best  performance.  Fig¬ 
ure  7b  shows  the  results  for  a  16-node  system.  It  can  be 
seen  that  the  hierarchical  algorithm  outperforms  the  other 
two  algorithms  for  the  16-node  system. 

The  models  presented  in  this  paper  can  be  used  to  eval¬ 
uate  the  performance  of  different  collective  communication 
operations  in  heterogeneous  environments.  These  models 
can  be  used  for  comparing  the  performance  of  different  al¬ 
gorithms  for  different  collective  operations  and  identifying 
the  most  efficient  algorithms. 

4.2  Effect  of  Configuration  on  Performance 

The  proposed  communication  models  can  also  be  used 
for  predicting  the  effect  of  configurations  in  KNOW  sys¬ 
tems.  For  instance,  consider  a  system  with  a  set  of  slow  and 
fast  nodes.  We  are  interested  in  knowing  how  the  perfor¬ 
mance  of  collective  communication  operations  varies  based 
on  the  number  of  fast/slow  nodes  in  the  system.  Figure  8 
illustrates  the  estimated  completion  times  of  the  gather  op¬ 
eration  for  varying  number  of  fast  nodes  in  systems  with  8 
and  16  nodes,  respectively.  (The  root  is  always  considered 
to  be  a  fast  node.)  It  can  be  observed  that  having  more  than 
four  fast  nodes  in  these  systems  does  not  improve  the  per¬ 
formance  of  this  operation  significantly.  The  models  pre¬ 
sented  in  this  paper  can  be  used  to  evaluate  the  performance 
of  other  collective  communication  operations  and  other  sys¬ 
tem  configurations. 

5  Conclusions  and  Future  Work 

In  this  paper,  we  have  proposed  a  new  model  for  estimat¬ 
ing  the  cost  of  point-to-point  and  different  collective  oper¬ 
ations  in  the  emerging  HNOW  systems.  We  have  verified 
the  validity  of  our  models  by  using  an  experimental  het¬ 
erogeneous  testbed.  In  addition,  we  have  shown  how  this 
model  can  be  used  to  compare  different  algorithms  for  dif¬ 
ferent  collective  operations.  We  have  also  shown  that  this 
model  can  be  used  to  predict  the  effect  of  having  different 
types  of  computing  nodes  on  the  performance  of  collective 
operations. 

We  plan  to  evaluate  our  model  by  using  other  promis¬ 
ing  networking  technologies  (such  as  Myrinet  and  ATM) 
and  underlying  point-to-point  communication  layers  (such 
as  FM  and  U-Net).  We  also  plan  to  extend  our  model  for 
systems  with  heterogeneous  networking  technologies.  We 
are  exploring  how  this  model  can  be  used  to  predict  the  ex¬ 
ecution  time  of  parallel  applications  on  heterogeneous  sys¬ 
tems.  By  doing  so,  we  will  be  able  to  consider  the  effect 
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Figure  5.  Estimated  and  measured  completion  times  for  the  gather  operation  on  four  nodes  with  three 
different  configurations.  The  types  of  the  participating  nodes  from  node  0  to  node  3  (in  MPI  ranking) 
are:  A:  fast,  fast,  fast,  fast;  B:  fast,  fast,  slow,  slow;  C:  slow,  fast,  slow,  fast. 


(c)  Sequential  Tree 


Figure  6.  Three  different  possible  trees  for  implementing  broadcast  in  a  HNOW  system  with  eight 
nodes  (F  =  fast  node,  S  =  slow  node). 


(a)  8-node  system  (b)  16-node  system 


Figure  7.  Comparing  the  performance  of  three  different  broadcast  algorithms  (based  on  the  respective 
tree  structures  of  Figure  6)  on  8-node  and  16-node  heterogeneous  systems. 
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(a)  8-node  system 
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Figure  8.  Comparison  between  the  completion  times  of  the  gather  operation  on  8-node  and  16-node 
systems  with  different  number  of  fast  and  slow  nodes. 


of  communication  time  in  overall  execution  time  of  an  ap¬ 
plication  more  accurately  and  based  on  that  come  up  with 
better  load  balancing  schemes. 
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Abstract 

A  framework  for  task  assignment  in  heterogeneous  com¬ 
puting  systems  is  presented  in  this  work.  The  framework  is 
based  on  a  learning  automata  model.  The  proposed  model 
can  be  used  for  dynamic  task  assignment  and  scheduling 
and  can  adapt  itself  to  changes  in  the  hardware  or  network 
environment.  The  important  feature  of  the  scheme  is  that 
it  can  work  on  multiple  cost  criteria,  optimizing  each  cri¬ 
terion  individually.  The  cost  criterion  could  be  a  general 
metric  like  minimizing  the  total  execution  time,  or  an  appli¬ 
cation  specific  metric  defined  by  the  user.  The  application 
task  is  modeled  as  a  task  flow  graph(TFG),  and  the  net¬ 
work  of  machines  as  a  processor  graph(PG).  The  automata 
model  is  constructed  by  associating  every  task  in  the  TFG 
with  a  variable  structure  learning  automaton  [I].  The  ac¬ 
tions  of  each  automaton  correspond  to  the  nodes  in  the  PG. 
The  reinforcement  scheme  of  the  automaton  considered  here 
is  a  linear  scheme.  Dijferent  heursitic  techniques  that  guide 
the  automata  model  to  the  optimal  solution  are  presented. 
These  heuristics  are  evaluated  with  respect  to  dijferent  cost 
metrics. 

Key  words:  Task  assignment,  variable  structure  learning 
automata,  multiple  cost  criteria,  stochastic  optimization, 
task  flow  graph  and  processor  graph. 


1.  Introduction 

Heterogeneous  computing(HC)  [2,  8,  15],  is  the  tuned 
use  of  diverse  processing  hardware  to  meet  distinct  compu¬ 
tational  needs.  A  HC  environment  consists  of  a  heteroge¬ 
neous  suite  of  machines  and  high  speed  interconnections, 
which  can  be  used  effectively  to  execute  different  portions 
of  an  application.  An  application  is  usually  represented  as 
a  task  flow  graph  (TFG),  which  is  a  directed  acyclic  graph 
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showing  the  data  dependencies  between  the  various  tasks  of 
the  application.  An  important  problem  in  this  domain  is  the 
task  assignment  problem,  which  corresponds  to  the  assign¬ 
ment  of  tasks  to  the  machines  in  the  HC  suite  such  that  a 
specified  cost  criterion  is  optimized.  It  is  well  known  that 
the  assignment  problem  in  general  is  NP-complete  [6]. 

In  the  literature,  a  variety  of  mathematical  formula¬ 
tions  have  been  developed  for  the  task  assignment  prob¬ 
lem.  These  formulations  are  collectively  called  selection 
theory  and  try  to  choose  the  appropriate  machine  for  each 
subtask  of  the  TFG  [6, 7, 9, 10, 4].  Approaches  to  the  prob¬ 
lem  based  on  graph  theoretic  techniques  [5,  4],  simulated 
annealing  [14],  exhaustive  state  space  search  [16],  and  ge¬ 
netic  techniques  [12,  14,  11]  have  been  proposed.  In  this 
work,  the  authors  propose  a  new  learning  automata  model 
for  the  assignment  problem,  which  is  based  on  a  variable- 
structure  learning  automaton  [1].  In  [3],  a  stochastic  learn¬ 
ing  automaton  model  was  proposed  for  scheduling  in  dis¬ 
tributed  systems.  The  systems  were  assumed  to  be  homo¬ 
geneous  and  had  a  large  number  of  automaton  states,  2^ 
for  an  M-node  network.  The  model  proposed  here,  asso¬ 
ciates  an  automaton  with  every  task  in  the  TFG,  and  hence 
there  are  only  M  states  for  an  M-node  network.  The  dif¬ 
ferent  reinforcement  schemes  of  the  automata  define  the  be¬ 
havior  of  the  assignment  algorithm.  The  key  feature  of  the 
proposed  model  is  its  ability  to  optimize  multiple  cost  met¬ 
rics.  In  other  works  multimetric  optimization  is  achieved 
by  optimizing  the  weighted  sum  of  the  various  metrics.  In 
the  proposed  model,  the  cost  metrics  are  optimized  inde¬ 
pendent  of  each  other  subject  to  the  weight  associated  with 
it.  The  cost  criterion  itself  could  be  a  general  metric  like 
minimizing  the  total  execution  time,  or  it  could  be  defined 
specifically  to  suit  the  needs  of  an  application. 

The  paper  is  organized  as  follows.  Section  2  introduces 
the  framework,  followed  by  a  description  of  the  HC  system 
model  and  the  cost  criteria  in  section  3.  A  construction  of 
the  automata  model  and  the  heuristic  techniques  proposed 
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for  the  model  are  presented  in  the  next  section.  The  subject 
of  section  5  is  the  performance  analysis  of  these  techniques. 
The  last  section  provides  the  conclusions  and  scope  for  fu¬ 
ture  work. 

2.  Model  of  the  Framework 

This  section  presents  a  general  model  of  the  proposed 
framework  for  task  assignment  in  the  HC  system.  Fig¬ 
ure  1  depicts  a  schematic  representation  of  the  framework. 
The  environment  consists  of  the  heterogeneous  suite  of  ma¬ 
chines  which  will  be  used  to  execute  the  application.  The 
scheduling  system  consists  of  the  proposed  learning  au¬ 
tomata  model  and  a  model  of  the  application  and  HC  sys¬ 
tem.  These  models  are  used  by  the  scheduler  to  assign  the 
subtasks  to  the  different  machines. 


Scheduling  System 


Figure  1.  Model  of  the  Proposed  Framework 


The  HC  system  environment  consists  of  a  heteroge¬ 
neous  suite  of  machines  interconnected  in  a  specific  topol¬ 
ogy.  These  machines  could  include  sequential  processors, 
MIMD  systems,  SIMD  systems,  or  special  purpose  archi¬ 
tectures  and  DSP  processors.  All  these  various  systems 
are  assumed  to  be  interconnected  by  a  high  speed  inter¬ 
connection  network.  The  application  task  is  assumed  to  be 
represented  as  a  task  flow  graph(TFG).  The  TFG  is  a  di¬ 
rected  acyclic  graph  which  shows  the  dependencies  of  the 
subtasks.  The  interconnection  of  the  machines  in  the  HC 
system  is  represented  similarity  by  means  of  a  processor 
graph(PG).  The  objective  of  the  scheduling  system  is  to  as¬ 
sign  the  subtasks  of  the  TFG  to  the  processors  in  the  PG, 
so  that  the  defined  set  of  cost  criteria  are  optimized.  This 
is  achieved  by  means  of  a  automata  model  proposed  in  this 
paper.  The  model  is  built  on  a  variable  structure  learning 
automaton  and  uses  an  iterative  algorithm.  The  external  en¬ 
vironment  for  this  automaton  model  is  the  system  model 
formed  by  the  TFG  and  the  PG.  Once  the  scheduling  sys¬ 
tem  determines  on  which  processors  the  various  tasks  are 
to  be  executed,  the  actual  execution  on  the  HC  system  can 
take  place. 


The  advantages  of  the  proposed  framework  can  be  seen 
quite  clearly.  Since  the  actual  task  assignment  and  schedul¬ 
ing  is  based  on  a  model  of  the  application(TFG)  and  the 
machines  of  the  HC  system(PG),  the  framework  can  be  used 
for  dynamic  task  assignment  and  scheduling.  The  proposed 
learning  automaton  model  optimizes  the  set  of  cost  criteria 
based  on  a  given  PG.  Hence,  whenever  there  is  a  change 
in  the  machine  or  network  configuration,  the  model  will 
generate  a  solution  that  is  optimal  to  the  transformed  sys¬ 
tem  configuration.  The  important  feature  of  the  proposed 
framework,  is  the  incorporation  of  multiple  cost  criteria,  all 
of  which  are  optimized  simultaneously.  The  following  sec¬ 
tions  explain  the  construction  of  the  framework  and  high¬ 
light  its  advantages. 

3,  HC  System  Model 

This  section  deals  with  the  modeling  of  the  application 
and  the  hardware  configuration  of  the  HC  system.  Certain 
assumptions  that  have  been  made  are  introduced  first.  The 
later  part  of  the  section  discusses  how  the  different  cost  met¬ 
rics  for  the  system  can  be  defined.  The  task  assignment  al¬ 
gorithm  will  generate  a  solution  that  tries  to  optimize  these 
cost  criteria. 

3.1.  Assumptions 

1.  The  application  program  is  assumed  to  be  decomposed 
into  multiple  tasks.  The  data  dependencies  between  the 
tasks  are  given  by  a  Task  Flow  Graph(TFG),  which  is  an 
acyclic  directed  graph. 

2.  The  HC  system  is  assumed  to  consist  of  a  set  of  het¬ 
erogeneous  machines,  which  communicate  by  means  of  an 
underlying  interconnection  network.  This  is  represented  by 
a  Processor  graph(PG). 

3.  The  expected  execution  time  of  the  tasks  on  each  ma¬ 
chine  in  the  HC  suite  is  known  a  priori.  These  execu¬ 
tion  times  can  be  obtained  by  task  profiling  and  analytical 
benchmarking  techniques. 

4.  The  cost  of  communicating  a  single  unit  of  data  between 
any  two  machines  is  also  known  a  priori.  If  any  pair  of  ma¬ 
chines  in  the  HC  suite  cannot  communicate  with  each  other, 
then  the  cost  of  communication  between  them  is  assumed  to 
be  00. 

3.2.  Cost  Metric 

The  TFG  contains  a  set  of  tasks  S.  Si  is  the  ith  task  in 

5.  The  set  of  machines  in  the  HC  environment  is  given  by 
M,  where  mj  is  the  jth  machine.  C  denotes  the  set  of  cost 
metrics  defined  for  the  application.  The  assignment  prob¬ 
lem  then  corresponds  to  a  mapping  tt  from  the  set  S  to  the 
set  M,  such  that  the  metrics  in  C  are  optimized. 
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5  =  {si,0<i<|5|} 

M  =  {mj,0<j  <\M\} 

C  =  {ck,0  <  k  <  \C\}  where  Ck  is  a  specific  cost  metric 
ir  :  S  M 


The  total  execution  time  is  the  sum  of  the  total  computation 
and  communication  time  for  the  task  assignment  at  an  iter¬ 
ation.  Therefore  for  this  cost  metric,  C2(n)  would  be: 

C2(n)  =  cp{n)  +  cc{n) 


A  solution  vector  X{n)  of  size  |5|  x  |M|,  gives  an  in¬ 
stance  of  the  mapping  tt  at  iteration  n.  Each  element  of  the 
vector,  Xi{n)  indicates  the  machine  to  which  that  particu¬ 
lar  task  is  assigned  to  at  iteration  n.  The  cost  of  execut¬ 
ing  a  task  on  a  machine,  the  communication  time  between 
the  machines,  the  number  of  data  units  exchanged  between 
tasks  and  the  amount  of  power  consumed  by  executing  a 
task  on  a  machine,  are  all  given  by  the  following  matrices. 
X{n)  ->  Solution  Vector  of  order  |5|  x  |M| 

EX  Execution  cost  matrix  of  order  |5|  x  |M| 

CC  Communication  cost  matrix  of  order  |  M  |  x  |  M  | 
DX  -¥  Data  exchange  matrix  of  order  \S\  x  |5| 

PW  Power  cost  matrix  of  order  |5|  x  |M| 

The  next  part  in  formulating  the  problem  involves  the 
actual  definition  of  the  cost  metrics.  There  are  a  number  of 
such  metrics  that  can  be  defined  by  the  user  based  on  the  ap¬ 
plication.  For  instance,  the  user  might  want  to  minimize  the 
total  execution  time,  or  minimize  the  maximum  loaded  pro¬ 
cessor,  or  minimize  the  amount  of  power  consumed.  There 
may  also  be  other  metrics  that  are  specific  to  a  particular 
application.  These  may  be  defined  with  the  help  of  the  ma¬ 
trices  defined  above.  Each  of  these  metrics  is  defined  at  an 
iteration  n  as  Ck{n).  A  couple  of  these  metrics  are  defined 
here: 


Minimize  the  load  on  the  maximum  loaded  processor: 

Let  the  total  load  on  a  particular  machine  say  mj  at  iter¬ 
ation  n  be  represented  as  Ij  (n).  Therefore 

“  ZllJo,a;i(n)=i  + 

a:i(n)=j  Slii+l  *  (^(^Xi{n)xk{n) 

In  the  equation  for  /j(n),  the  first  part  represents  the  total 
computation  cost  on  the  processor  mj  and  the  second  part 
corresponds  to  the  total  communication  cost  incurred  by  the 
processor  at  iteration  n.  The  load  on  every  processor  needs 
to  be  computed,  in  order  to  identify  the  heaviest  loaded  pro¬ 
cessor.  The  optimal  assignment  for  the  problem  would  be 
that  which  results  in  minimizing  the  load  on  the  heaviest 
loaded  processor.  Therefore,  the  cost  metric  ci  (n)  would 
be: 

Ci(n)  =  Max{lj{n))  fovj  =  lto\M\  —  l 


Minimize  the  total  execution  time: 

Let  cp{n)  represent  the  total  computation  time  for  a  par¬ 
ticular  assignment  and  cc(n)  represent  the  total  communi¬ 
cation  time  at  a  particular  iteration.  Hence: 

Cp{n)  =  ^Uo  ^^ixi(n) 

cc(n)  =  Z)lJo  ^^ik  *  ^^Xi{n)xk{n) 


Minimize  the  power  consumed: 

The  cost  metric  for  this  formulation,  C3(n)  can  be  repre¬ 
sented  as: 

C3(n)  =  X:l=o"^  PWixiin) 

The  task  assignment  algorithm,  would  now  try  to  min¬ 
imize  each  of  the  Ck{n)  (k  —  1  to  IC'D)  metrics  over  n 
iterations  until  the  absolute  minimum  is  reached. 

4.  Proposed  Learning  Automata  Model 


In  the  proposed  framework,  a  learning  automata  model 
is  used  to  determine  an  optinial  task  assignment  for  the  HC 
system.  This  model  is  based  on  a  variable  structure  learning 
automaton  explained  in  [1].  The  section  begins  with  an  in¬ 
troduction  to  this  automaton  model.  The  mathematical  for¬ 
mulation  and  advantages  of  using  the  model  are  discussed. 
The  construciton  of  such  an  automata  model  for  the  pur¬ 
pose  of  task  assignment  in  the  HC  system  forms  the  later 
part  of  this  section.  The  section  details  how  the  model  for 
the  system  is  constructed.  The  HC  system  model,  which 
is  reflective  of  the  actual  HC  system,  serves  as  the  external 
environment  for  the  automata  model. 

4.1.  Variable Structure  Learning  Automata 

The  variable  structure  learning  automata  is  represented 
by  the  quintuple  {</>,  a,  /?,  A,  G},  where 

1.  The  state  of  the  automaton  at  any  iteration  n,  denoted  by 
<j>{n)  is  an  element  of  the  finite  set 

<t>  =  {01, 

2.  The  output  of  the  automaton  at  the  iteration  n,  denoted 
by  a(n),  is  an  element  of  the  set 

a  =  {ai,a2,...,ar}. 

3.  The  input  to  the  automaton  at  the  iteration  n,  denoted  by 
/3{n),  is  an  element  of  the  set 

P  =  {^1,  ^2,  /?m}‘ 

4.  A,  is  the  updating  algorithm  or  the  reinforcement 
scheme. 

5.  The  output  function  G(.),  determines  the  output  of  the 
automaton  at  any  iteration  n  in  terms  of  the  state  at  that  it¬ 
eration. 

a{n)  =  G[(t>{n)]. 

If  each  state  of  the  quintuple  corresponds  to  a  unique  action 
and  the  number  of  states  are  finite  (as  will  be  the  case  with 
our  HC  system),  then  r  =  s  <  oo  and  hence  G  is  an  identity 
mapping.  Therefore,  the  automata  can  be  represented  as  the 
triple  {a,/3,  A}. 
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An  action  probability  pj(n)  is  associated  with  every  ac¬ 
tion  or  state  of  the  automaton  at  any  iteration  n.  Pj{n)  rep¬ 
resents  the  probability  that  the  automata  will  be  in  state  j  at 
an  iteration  n,  amongst  the  r  possible  states  of  the  automa¬ 
ton.  The  updating  algorithm  or  the  reinforcement  scheme, 
A,  attempts  to  update  the  action  probabilities  at  every  itera¬ 
tion  n  and  brings  the  automaton  to  an  optimal  solution  state. 
The  precise  manner  in  which  the  action  ai  performed  at  an 
iteration  n  and  the  response  P{n)  of  the  environment  change 
the  probability  pj  (n),  completely  defines  the  reinforcement 
scheme.  If  pj{n  -1-  1)  is  a  linear  function  of  Pj{n),  then 
the  reinforcement  scheme  is  said  to  be  linear,  otherwise  it  is 
termed  non-linear. 

Reinforcement  Scheme 

Consider  a  variable-structure  automaton  with  r  actions  to  be 
operating  in  a  stationary  environment  with  (3  =  {0, 1}. 

In  other  words,  the  automaton  is  operating  in  a  P-model  en¬ 
vironment,  where  the  response  is  binary  valued.  Let  N  be 
the  set  of  nonnegative  integers  and  let  n  e  A".  A  general 
scheme  to  update  the  action  probabilities  can  be  represented 
as: 

Ifain)  =  ai  (i  =  1,2,  ...,r) 

Pj{n  +  1)=  Pj  (n)  -  gj  \p{n)]  when  ^(n)  =  0 

Pj{n  -h  1)  =  Pji'f^)  +  hj\p{n)]  when  /3(n)  =  1 

for  all  j  ^  i. 

Now,  we  have  Pji'^)  =  order  to  preserve  the 
probability  measure.  Therefore 

Pi{n+1)  =  Pi(n)  +  Ei=ij^i  9j{pin))  when^(n)  =  0. 
Pi(n+1)  hjipin))  when^(n)  =  1. 

Consider  the  variable  structure  automaton  operating  in 
the  5- model  environment,  where  the  responses  can  take 
continuous  values  over  an  interval.  By  normalization,  it  can 
be  assumed  that  the  interval  is  a  unit  interval  [0,1].  Hence, 
for  this  model  is  a  continuous  variable  where /3  €  [0,1]. 
The  equations  for  the  5-model  can  be  written  as: 

Pi{n  +  1)  =  pi{n)  -  {l-0{n))*gi{p{n))  +  l3{n)*hi(p{n) 
if  a(n)  7^  ai 

Pi{n  +  1)  =  Pi{n)  +  (1  -  /3(n))  ♦  9j{p{n))  -  /?(n) 

*  ^'(p(”) 

ifa(n)  =  ai 

The  following  assumptions  are  made  with  regard  to  the 
function  gj  and  hj{j  =  1, 2...,  r). 

Assumption  1 :  gj  and  hj  are  continuous  functions. 
Assumption  2  :  gj  and  hj  are  non-negative  functions. 
Assumption  3:0  <  gj{p)  <  Pj 

0  <  lPj  +  hj{p)]  <  1 

for  all  z  =  1, 2,  ...r  and  all  p  whose  elements  are  all  in  the 
open  interval  (0,1). 

Different  reinforcement  schemes  are  represented  by  how 
the  functions  g{,)  and  h{.)  are  characterized.  For  instance  if 
the  functions  are  linear  representations  of  the  action  proba¬ 
bilities,  then  scheme  is  linear,  otherwise  they  are  non-linear. 


It’s  also  possible  to  have  hybrid  scheme’s.  Essentially,  the 
nature  of  the  reinforcement  scheme  affects  the  behavior  of 
the  model. 

4.2.  Construction  of  the  Model 

Figure  2  shows  the  schematic  of  the  learning  automata 
model.  The  model  is  constructed  by  associating  every  task 
Si  in  the  TFG  with  a  variable  structure  automaton.  Each  of 
the  automata  are  represented  as  a  3-tuple, 

Since  the  tasks  can  be  assigned  to  any  of  the  |M|  machines, 
the  action  set  of  the  automata  are  identical.  It  is  assumed 
that  the  environment  is  a  5-model  environment  and  hence 
the  domain  of  the  input  set  to  the  automata  is  in  the  interval 
[0, 1]. 

Therefore  for  any  task  Sj,  0  <  z  <  |5|: 

13^^  e  [0,1] 

where  closer  to  0,  indicates  that  the  action  taken  by  the 
automaton  of  task  Si  was  favorable  to  the  system,  and  closer 
to  1  indicates  an  unfavorable  response. 


Overall  response,  OR^ 


Figure  2.  Learning  Automata  Model 


The  output  response  of  the  HC  system  model  corre¬ 
sponding  to  each  of  the  cost  metrics  in  the  set  C,  is  assumed 
to  be  binary.  In  other  words,  if  the  value  of  a  particular  cost 
metric  Ck  £  C  at  iteration  n,  is  more  optimal  than  its  value 
at  iteration  n  —  1,  then  the  output  is  favorable  and  if  not  the 
response  is  unfavorable. 

If  Ck{n)  is  more  optimal  then  Ck{n  -  1) 

then  ORk  =  0,  else  ORk  =  1  where  1  <  A:  <  l^l 
where  ORk  is  the  overall  output  response  of  the  system  with 
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respect  to  c<;(n). 

The  next  step  in  developing  the  model  is  to  translate 
ORk  into  the  inputs  (n)  for  each  of  the  automata.  This 
involves  two  steps.  First,  ORk  would  have  to  be  translated 
into  the  input  to  each  of  the  Sj  automata.  Let  fxl*  represent 
this  value.  Hence,  (n)  corresponds  to  the  input  to  the 
automaton  (a®’ ,  ),  with  respect  to  cost  metric  c*  at 

iteration  n.  Assume  that  A*  indicates  the  weightage  associ¬ 
ated  with  the  cost  metric  c*  (n) .  Now  the  second  step  of  this 
two  step  process  can  be  represented  as: 

(n)  =  X)i=i  for  i  =  0  to  |5|  -  1 

where 

EL=i  >^k  =  1  and  A*  >  0  for  A:  =  1  to  \C\ 

There  are  different  ways  of  accomplishing  the  first  step. 
The  simplest  and  easiest  way  would  be  to  make  ORk  as  the 
input  to  each  of  the  automata  in  the  model. 

HI  :  ij,l'  (n)  =  ORk  for  all  si  e  5,  and  Ck  E  C 

HI  is  simple  to  implement  and  completely  ignores  the 
structure  of  the  TFG,  while  guiding  the  learning  of  the  au¬ 
tomata  in  the  model.  To  put  it  differently,  using  HI  it  is 
possible  that  an  unfavorable  output  response  from  the  sys¬ 
tem  may  be  falsely  interpreted  as  an  unfavorable  input  to  a 
particular  automaton. 

Ideally,  the  model  should  generate  an  assignment  that 
results  in  a  consistent  favorable  output  response  from  the 
system.  It  is  difficult  to  achieve  this  objective  based  on  the 
structure  of  the  TFG.  To  overcome  this  problem,  the  authors 
propose  techniques  based  on  the  history  of  the  actions  gen¬ 
erated.  With  every  action  of  an  automaton,  two  additional 
fields  are  associated.  The  first  one  indicates  the  number  of 
favorable  responses,  and  the  second  field  indicates  the  num¬ 
ber  of  unfavorable  responses,  corresponding  to  a  particular 
action.  Consequently: 

FAV  Matrix  of  order  |5|  x  \M\  x  \Cl  the  elements 
fav[i,j,  k]  of  which  indicate  the  number  of  times  a  favor¬ 
able  response  resulted  when  task  Si  was  assigned  to  ma¬ 
chine  TTij  for  cost  metric  c^. 

UNFAV  Matrix  of  order  |5|  x  \M\  x  ICj,  the  ele¬ 
ments  unfav[i,  j,  k]  of  which  indicate  the  number  of  times 
an  unfavorable  response  resulted  when  task  Si  was  assigned 
to  machine  rrij  for  cost  metric  Ck, 

Based  on  this  history  information,  different  heuristic 
techniques  can  be  developed  that  will  help  guide  the  au¬ 
tomata  to  the  optimal  assignment.  Five  general  heuristic 
techniques  that  the  authors  experimented  with  are  presented 
here.  Knowledge  about  the  nature  of  an  application  can  be 
exploited  to  develop  heuritics  specific  to  that  application. 
But  in  this  work,  no  assumptions  are  made  with  respect 
to  the  applications  and  consequently  the  heuristics  can  be 
used  in  any  environment.  The  results  of  these  techniques 


are  compared  with  each  other,  and  with  the  technique  HI 
that  is  not  dependent  on  the  structure  of  the  TFG. 

H2  :  If  fav[i^  Xi{n),  fc]  >  unfav[i,  Xi{n),  k] 
then/i^'(n)  =  0,  else (n)  =  1 
for  all  Si  e  5,  and  Ck  eC 

Since  the  objective  is  to  maximize  the  number  of  fa¬ 
vorable  responses  and  minimize  the  unfavorable  ones,  this 
heuristic  assigns  an  input  to  the  automaton  as  favorable  if 
for  that  particular  action  the  number  of  favorable  responses 
is  greater  than  the  unfavorable  ones. 

ifS  :  If  {fav[iyXi{n),  k]  —  unfav[ij  Xi{n)],  k)  is  the  max¬ 
imum  for  all  actions  of  Si  €  5,  and  Ck  e  C 

then/i^*(n)  =  0,  else^^*(n)  =  1 
Here  the  difference  in  favorable  and  unfavorable  re¬ 
sponses  is  compared  against  all  other  actions.  If  the  cho¬ 
sen  action  gives  the  maximum  value,  the  automaton  input  is 
termed  favorable. 


,o<«<|M|  fav\p,q,k] 
-unfav[ijXi{n),k]) 
-  fav[i,Xi{n)^k]  + 


H4  :lf{fav[i,  Xi{n),  k]  -  mmo<p<|s  , 
+972.aa7o<p<|S’|,o<g<|M|'^^/^'*^[P)  9?  ^ 

>  {maXo<p^\S\,0<q<\M\  fo>'^\P^Qyk] 
unfav[i,Xi{n),k]  -  7nino^p^\slo<q<\M\  unfav\p,q,k]) 
Aen  (n)  =  0,  else  /i^*  (n)  =  1, 
for  all  Si  e  5,  and  Ck  €  C 

In  this  heuristic,  if  the  chosen  action  has  a  high  value 
for  favorable  responses  and  a  low  value  for  unfavorable  re¬ 
sponses,  the  input  is  favorable. 


H5  :  If  fav[i,Xi{n),k]  =  maa:o<p<|S|,o<9<|Ml 
fav\p,q,k], 

then  (n)  =  0,  else  (n)  =  1, 
for  all  Si  e  5,  and  Ck  E  C 

In  H5,  if  the  chosen  action  has  the  maximum  favorable 
responses,  the  automaton  input  is  favorable  else  it  is  unfa¬ 
vorable. 

H6  :  If  unfav[i,Xi{n),k]  =  mmo<p<|5|,o<9<lM| 

unfav\p,q,k], 

then =  0,  else /i^*  (n)  =  1, 
for  all  Si  e  5,  and  Ck  E  C 

Similarily,  in  H6  if  the  chosen  action  has  the  least  unfa¬ 
vorable  responses  the  input  is  favorable. 

In  order  to  complete  the  model,  action  probabilities  need 
to  assigned  for  the  actions  of  the  automata.  The  action  set 
of  an  automaton  is  equal  to  the  set  \M\  as  discussed 
earlier.  Hence,  the  action  probability  would  be  the  proba¬ 
bility  of  assigning  a  task  Si  to  one  of  the  machines.  Let  this 
probility  bepij(n).  Therefore: 

Pij  (n)  the  probability  of  assigning  task  Si  to  ma¬ 
chine  rrij. 
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To  preserve  the  probability  measure,  the  sum  of  all  the 
action  probabilities  for  a  particular  task  should  equal  one. 
Therefore: 

1  for  any  Si  G  5 

Reinforcement  Scheme 

The  general  reinforcement  scheme  for  this  model  can  now 
be  formulated  in  terms  of  the  action  probabilities.  Follow¬ 
ing  the  construction  of  the  scheme  from  [1],  consider  the 
automaton  for  a  task  Sj : 

Pijin+1)  =  Pij{n)  -  ((1  -  0^'in))  *  9{pij{n))) 
+  W^'in)  *  h{pij{n))) 
for  all  j  /  Xi{n) 

Pixi{n)in  +  l)  =  Pixi{n){n)  +  ((1  -  ^*’(n))* 

'>(!>«(»))) 

The  nature  of  the  functions  g{.)  and  h{.)  determine  the 
nature  of  the  reinforcement  scheme  and  hence  the  perfor> 
mance  of  the  scheduling  system.  These  functions  could  be 
linear,  nondinear  or  hybrid.  For  our  study  here,  it  is  as¬ 
sumed  that  the  functions  are  linear  and  are  defined  as: 

P(PiiW)  =  a  *  {Pij{n)) 

and 

h{Pij{n))  =  b/{\M\-l)  -  {b^  pijin)) 

Parameters  a  and  b  are  known  as  reward  and  penalty  pa¬ 
rameters  respectively,  and  help  guide  the  automaton  to  the 
optimal  solution.  The  choice  of  these  parameters  affects  the 
behavior  of  each  of  the  proposed  techniques.  The  next  sec¬ 
tion  details  the  performance  analysis  of  these  techniques. 

5.  Results  and  Discussion 

In  this  section,  the  different  heuristics  that  are  proposed 
are  analysed  with  respect  to  specific  cost  metrics.  It  begins 
with  an  introduction  to  the  simulation  environment  and  sub¬ 
sequently  the  results  are  presented  and  analysed. 

The  performance  of  the  different  techniques  was  evalu¬ 
ated  by  using  randomly  generated  task  graphs  and  proces¬ 
sor  graphs.  The  execution  cost,  communication  time,  data 
exchange  and  the  power  cost  matrices  were  randomly  gen¬ 
erated  over  some  predefined  ranges  with  uniform  probabil¬ 
ity.  A  specific  set  of  cost  metrics  was  then  chosen  and  the 
performance  evaluated  for  varying  values  of  the  reward  and 
penalty  parameters,  a  and  6.  The  automata  model  was  it¬ 
erated  until  the  probability  of  a  chosen  action  in  each  au¬ 
tomaton  exceeded  0.99,  or  the  number  of  iterations  reached 
a  maximum  limit.  If  the  former  condition  stopped  the  au¬ 
tomata,  then  the  model  was  said  to  converge.  The  latter  con¬ 
dition  meant  the  model  was  non-convergent.  Table  1  shows 
the  predefined  ranges  used  to  generate  the  random  graphs. 
The  efficiency  of  an  algorithm  is  determined  by  the  quality 
of  the  solution  generated  and  the  fastness  of  the  algorithm. 


Number  of  tasks 

10 

Number  of  machines 

7 

Number  of  edges 

5,10,15,20  and  25 

Execution  matrix  data  range 

1000 

Communication  matrix  data  range 

4 

Data  exchange  matrix  data  range 

500 

Power  matrix  data  range 

25 

Table  1.  Parameters  for  TFG  and  PG 


Therefore,  in  order  to  study  the  performance  of  our  model, 
the  number  of  iterations  required  for  convergence  and  the 
solution  cost  generated  are  studied  as  a  function  of  the  com¬ 
munication  complexity. 


Figure  3.  #  of  iterations  vs  Edges:  a  =  b  =  0.1 


Initially,  to  verify  the  working  of  the  model,  a  single 
cost  metric  ci  (refer  section  3)  was  chosen.  Two  sets  of  stud¬ 
ies  were  performed.  For  the  first  one,  the  reward  and  penalty 
parameters  a  and  b  were  made  equal,  a  =  b  =  0.1,  and 
the  maximum  limit  for  the  number  of  iterations  was  set  at 
2000.  For  the  second  set,  a  =  0.1  and  6  =  0.009,  and  the 
limit  for  the  number  of  iterations  was  set  at  10000.  Figures 
3  and  4,  show  the  results  for  the  first  set  of  data.  From  the 
graphs,  it  can  be  concluded  that  when  a  =  b,  techniques 
HI,  HA  and  iT6  are  non-convergent  and  heuristic  H2  pro¬ 
vides  the  best  results  for  all  communication  complexities. 
Figures  5  and  6,  show  the  results  for  the  second  set  of  ex¬ 
periments.  Here,  techniques  HA  and  H6  continue  to  remain 
non-convergent  but  i?l  converges  though  it  requires  consid¬ 
erably  higher  number  of  iterations.  The  cost  graph  indicates 
HI  generates  better  results  than  H2. 

The  first  set  of  experimental  results  validate  the  working 
of  the  model.  The  best  solutions  generated  by  the  proposed 
techniques  were  close  to  the  optimal  solution.  The  results 
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Figure  4.  Cost  vs  Edges:  a  =  b  =  0.1 


indicate  that  the  choice  of  the  reward  and  penalty  parame> 
ters  affect  the  solution  cost  and  the  choice  of  the  heuristics. 


Figure  5.  #  of  iterations  vs  Edges:  a  =  0.1  ,b  = 
0.009 


In  order  to  study  the  effectiveness  of  the  model  for  multi¬ 
ple  cost  metric  optimization,  two  cost  metrics  were  chosen, 
Cl  and  C3  (refer  section  3).  Experiments  were  performed  by 
varying  the  weights  associated  with  the  metrics,  Aci  and  Xc^ 
such  that  Aci  +  Acg  =1.  The  maximum  limit  for  the  num¬ 
ber  of  iterations  was  set  at  2000.  The  values  for  the  reward 
and  penalty  parameters  were  chosen  to  get  the  best  results. 
The  results  that  were  obtained  have  been  tabulated  in  Tables 
2,3,4  and  5.  Techniques  H4  and  H6  were  once  again  non- 


convergent.  The  effect  of  the  weights  associated  with  cost 
metrics  can  clearly  be  seen  in  all  the  tables.  The  first  obser¬ 
vation  that  can  be  made  from  the  tables  is  that,  as  the  weight 
associated  with  a  particular  cost  metric  is  decreased  the  op¬ 
timality  of  its  solution  also  decreased.  Essentially  therefore, 
the  automata  model  achieves  independent  optimization  of 
the  cost  metrics  subject  to  its  weight.  This  is  different  from 
other  works  in  that,  the  weighted  sum  of  the  metrics  are  op¬ 
timized  instead  of  independent  Optimization.  With  regards 
to  the  heuristic  techniques,  it  can  be  seen  that  technique  HI 
produces  the  best  results.  For  each  of  the  graphs  that  were 
generated  the  optimal  cost  of  metric  Ci  was  985  and  that  of 
C2  was  17.  From  the  table  it  can  be  seen  clearly  that  solu¬ 
tion  generated  by  HI  is  close  to  the  optimal.  Technique  H2 
produced  the  next  best  results.  Both  HI  and  H2  converge 
in  almost  the  same  number  of  iterations.  This  is  in  sharp 
contrast  to  techniques  H3  and  H5  which  converge  very  fast. 
But  the  solution  space  generated  by  these  techniques  are  far 
from  the  optimal.  This  is  very  pronounced  as  the  commu¬ 
nication  complexity  increases.  But  HI  on  the  other  hand 
continued  to  produce  optimal  solutions  even  with  increased 
communication  complexity. 


#  of  edges 


Figure  6.  Cost  vs  Edges:  a  =  0.1  ,b  =  0.009 


6.  Conclusions  and  Future  Work 

In  conclusion,  this  work  presented  a  framework  for 
task  assignment  in  heterogeneous  computing  systems.  The 
framework  was  based  on  a  learning  automaton  model  and 
optimizes  multiple  cost  metrics.  The  proposed  model  can 
be  used  for  dynamic  task  assignment  and  scheduling.  The 
model  can  also  adapt  itself  to  changing  hardware  environ¬ 
ments.  The  cost  metrics  could  be  application  specific  or 
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#  of  edges 

Aci 

Ac3 

Cl 

C3 

Iterations 

5 

1.00 

0.00 

985 

51 

420 

0.75 

0.25 

985 

30 

529 

0.50 

0.50 

1039 

17 

504 

0.25 

0.75 

1125 

17 

476 

0.00 

1.00 

1168 

17 

366 

10 

1.00 

0.00 

985 

48 

430 

0.75 

0.25 

985 

34 

460 

0.50 

0.50 

1103 

17 

549 

0.25 

0.75 

1192 

17 

383 

0.00 

1.00 

1265 

17 

342 

15 

1.00 

0.00 

985 

53 

453 

0.75 

0.25 

985 

24 

960 

0.50 

0.50 

1154 

18 

564 

0.25 

0.75 

1342 

17 

516 

0.00 

1.00 

1431 

17 

266 

20 

1.00 

0.00 

985 

62 

327 

0.75 

0.25 

985 

27 

1021 

0.50 

0.50 

1561 

19 

476 

0.25 

0.75 

1776 

17 

572 

0.00 

1.00 

1776 

17 

246 

25 

1.00 

0.00 

985 

55 

907 

0.75 

0.25 

1119 

38 

1390 

0.50 

0.50 

1460 

20 

460 

0.25 

0.75 

1911 

17 

456 

0.00 

1.00 

1776 

17 

246 

#  of  edges 

Aci 

Acs 

Cl 

C3 

Iterations 

5 

1.00 

0.00 

1185 

57 

255 

0.75 

0.25 

1168 

33 

295 

0.50 

0.50 

1263 

23 

427 

0.25 

0.75 

1165 

23 

388 

0.00 

1.00 

1201 

19 

457 

10 

1.00 

0.00 

1156 

57 

317 

0.75 

0.25 

1168 

37 

341 

0.50 

0.50 

1192 

29 

266 

0.25 

0.75 

1265 

19 

340 

0.00 

1.00 

1215 

19 

457 

15 

1.00 

0.00 

1137 

54 

304 

0.75 

0.25 

1357 

39 

292 

0.50 

0.50 

1392 

19 

373 

0.25 

0.75 

1342 

28 

241 

0.00 

1.00 

1489 

19 

369 

20 

1.00 

0.00 

1692 

59 

306 

0.75 

0.25 

1634 

58 

389 

0.50 

0.50 

1561 

30 

336 

0.25 

0.75 

2061 

19 

226 

0.00 

1.00 

1909 

19 

443 

25 

1.00 

0.00 

1647 

67 

304 

0.75 

0.25 

1685 

58 

354 

0.50 

0.50 

1718 

24 

280 

0.25 

0.75 

1776 

17 

460 

0.00 

1.00 

1909 

19 

443 

Table  2.  Cost  values  for  Technique  H1  Table  3.  Cost  values  for  Technique  H2 


a  general  metric.  The  model  was  constructed  by  associat¬ 
ing  every  task  in  the  TFG  with  a  learning  automaton.  The 
various  automata  work  in  unison  to  produce  the  optimal  as¬ 
signment  of  tasks.  The  paper  presented  various  techniques 
to  guide  the  learning  of  the  automata.  These  techniques  are 
not  specific  to  any  application  and  are  applicable  to  any  en¬ 
vironment.  The  performance  analysis  of  these  techniques 
revealed  that  technique  H 1  results  in  the  most  optimal  so¬ 
lution  when  both  a  single  cost  metric  and  multiple  cost  met¬ 
rics  are  considered.  It  can  therefore  be  concluded  that  for 
the  proposed  framework,  the  structure  of  the  TFG  does  not 
adversely  affect  the  task  assignment  procedure.  However, 
techniques  specific  to  a  class  of  applications  can  be  devel¬ 
oped  that  will  produce  better  results.  Techniques  H3  and  H5 
provide  fast  results  but  they  are  sub-optimal.  As  future 
work,  it  would  be  interesting  to  study  how  these  techniques 
perform  for  application  specific  cost  metrics.  Also,  a  thor¬ 
ough  analysis  of  the  effect  of  the  reward  and  penalty  param¬ 
eters  in  guiding  the  model  to  the  optimal  solution  would  be 
essential. 
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#  of  edges 

Aci 

Acs 

Cl 

C3 

Iterations 

5 

1.00 

0.00 

1541 

61 

81 

0.75 

0.25 

1119 

79 

85 

0.50 

0.50 

1257 

68 

86 

0.25 

0.75 

1489 

34 

72 

0.00 

1.00 

2446 

32 

81 

10 

1.00 
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1909 

79 
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0.75 

0.25 

1851 

75 

84 

0.50 

0.50 

1489 

44 

89 

0.25 

0.75 

2150 

46 

77 

0.00 

1.00 

2446 

32 

79 

15 

1.00 

0.00 

1793 

95 

80 

0.75 

0.25 

1975 

105 

80 

0.50 

0.50 

2075 

59 

75 

0.25 

0.75 

2470 

44 

79 

0.00 

i.ob 

2472 

32 

78 

20 

1.00 

0.00 

3287 

121 

77 

0.75 

0.25 

2044 

94 

82 

0.50 

0.50 

2689 

49 

80 

0.25 

0.75 

3521 

25 

90 

0.00 

1.00 

2847 

32 

78 

25 

1.00 

0.00 

i  3287 

121 

77 

0.75 

0.25 

3001 

184 

81 

0.50 

0.50 

2689 

42 

76 

0.25 

0.75 

2061 

46 

86 

0.00 

1.00 

2847 

32 

79 

Table  4.  Cost  values  for  Technique  H3 
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#  of  edges 

Aci 

Acs 

Cl 

C3 

Iterations 

5 

1.00 

0.00 

1431 

012 

84 

0.75 

0.25 

1591 

88 

80 

0.50 

0.50 

1621 

85 

78 

0.25 

0.75 

1591 

50 

81 

0.00 

1.00 

1591 

50 

81 

10 

1.00 

0.00 

1621 

97 

75 

0.75 

0.25 

1591 

88 

85 

0.50 

0.50 

1633 

85 

78 

0.25 

0.75 

2280 

50 

81 

0.00 

1.00 

2280 

50 

81 

15 

1.00 

0.00 

1985 

97 

76 

0.75 

0.25 

1665 

97 

78 

0.50 

0.50 

2266 

72 

75 

0.25 

0.75 

2334 

50 

81 

0.00 

1.00 

2334 

50 

81 

20 

1.00 

0.00 

1993 

92 

80 

0.75 

0.25 

2949 

83 

79 

0.50 

0.50 

2266 

85 

79 

0.25 

0.75 

2255 

56 

76 

0.00 

1.00 

3948 

50 

81 

25 

1.00 

0.00 

2949 

102 

81 

0.75 

0.25 

2949 

96 

77 

0.50 

0.50 

2266 

85 

79 

0.25 

0.75 

2255 

56 

77 

0.00 

1.00 

3948 

50 

81 

Table  5.  Cost  values  for  Technique  H5 
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Abstract 

Scheduling  meta  applications  on  a  computational  grid 
uses  estimation  of  the  execution  times  of  component  pro¬ 
grams  to  compute  optimal  schedules.  In  a  realistic  case  var¬ 
ious  factors  (hazards)  lead  to  estimation  errors,  which  affect 
both  the  performance  of  a  schedule  and  resource  utiliza¬ 
tion.  We  introduce  the  concept  of  robustness  and  present  an 
analysis  technique  to  determine  the  robustness  of  a  sched¬ 
ule.  We  develop  methods  for  reducing  the  chance  that  a 
metaprogram  exceeds  its  execution  time  due  to  components 
outside  its  critical  path.  The  results  of  this  analysis  are 
used  to  compute  schedules  less  sensitive  to  hazards.  This 
translates  into  more  accurate  reservation  requirements  for 
critical  systems,  and  reduced  expected  execution  time  for 
non-critical  metaprograms  executed  repeatedly.  Simulation 
results  prove  the  efficiency  and  applicability  of  our  algo¬ 
rithms. 


1.  Introduction 

Informally,  a  metaprogram  is  a  collection  of  programs 
which  cooperate  towards  a  common  goal.  A  computational 
grid  is  an  abstraction  for  a  collection  of  autonomous  and 
heterogeneous  computers  interconnected  by  a  high  speed 
network.  A  metaprogram  is  executed  on  a  computational 
grid.  Throughout  this  paper  we  assume  that  there  is  no  one 
scheduler  which  controls  all  resources  of  the  system  and 
that  each  local  scheduler  accepts  reservations.  We  also  as¬ 
sume  that  a  meta  scheduling  agent  [3]  has  information  about 
the  execution  time  of  each  component  of  the  metaprogram 
and  is  capable  to  compute  schedules.  A  schedule  associates 
a  node  of  the  computing  grid  and  a  start  up  time  to  each 
component  of  the  metaprogram.  For  an  overview  of  high 
performance  schedulers  for  a  grid  of  autonomous  comput¬ 
ers  and  a  comprehensive  bibliography  on  the  subject  we  re¬ 
fer  the  reader  to  [1]. 


There  are  two  major  challenges  in  metaprogram  schedul¬ 
ing: 

(a)  The  scheduling  problem  is  NP-complete,  therefore 
finding  an  optimal  solution  may  be  impossible  or  im¬ 
practical  except  for  trivial  metaprograms  and  small  grids. 
To  overcome  the  explosion  of  the  search  space  for  real¬ 
istic  problems,  one  can  apply  approximation  algorithms 
(heuristic-guided  search),  or  genetic  algorithms  [5]. 

(b)  The  nondeterministic  nature  of  the  program  execu¬ 
tion  time  renders  even  an  optimal  solution  approximate,  or 
infeasible  depending  upon  the  resource  allocation  model. 
Several  solutions  to  accommodate  the  nondeterministic  ex¬ 
ecution  time  of  the  components  of  a  metaprogram  are  pos¬ 
sible; 

(bl)  Grossly  overestimate  the  execution  time  of  each 
program  and  minimize  the  risk  of  exceeding  the  alloted  use 
of  each  host  at  the  expense  of  the  utilization  of  the  grid, 
(b2)  Use  dynamic  algorithms  which  compute  schedules 
at  execution  time.  Once  the  data  flow  allows  a  program  to 
be  scheduled,  gather  information  about  the  state  of  the  grid 
and  compute  a  new  schedule  for  the  remaining  components 
of  the  metaprogram, 

(b3)  Use  static  scheduling  algorithms  to  compute  sched¬ 
ules  for  various  scenarios,  and  at  run  time  adopt  the  sched¬ 
ule  which  best  fits  the  current  conditions,  [5].  If  the  grid  is 
shared  by  multiple  metaprograms  this  may  not  be  feasible, 
(b4)  Use  static  algorithms  for  finding  schedules  less  vul¬ 
nerable  to  hazards  i.e.  more  robust. 

In  this  article  we  explore  the  last  alternative  and  observe 
that  it  can  be  used  in  conjecture  with  any  other  approach 
for  accommodating  the  nondeterministic  program  execution 
times. 

We  now  introduce  a  formalism  for  metaprogram 
scheduling  and  the  notations  used  throughout  this  paper: 

A,  B,  C  -  the  components  of  a  metaprogram 
Hj  -  the  hosts  of  the  grid 
Si  -  the  schedules 
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RunTime{Ni^  Hj)  -  the  execution  time  of  the  compo¬ 
nent  Ni  on  Hj 

Pi  -  a  path  of  the  metaprogram 
P  -  the  set  of  paths  of  the  metaprogram 
Vc  -  the  subset  of  the  potentially  critical  paths 
Vn  “  the  subset  of  non-critical  paths 
jV*  “  the  set  of  all  components  of  the  metaprogram 
Afc  -  the  subset  of  the  potentially  critical  components 
J\fn  -  the  subset  of  non-critical  components 
ti  “  the  estimated  execution  time  of  component  i  in 
schedule  S 

-  the  actual  execution  time  of  component  i  in  schedule 
S 

tsi  -  starting  time  of  component  i  in  schedule  S 
f -  the  completion/end  time  of  component  i  in  schedule 
S 

tmaxi  “  the  upper  bound  of  the  execution  time  of  com¬ 
ponent  i 

tsparei  {A  B)  -  the  Spare  time  of  the  link  from  com¬ 
ponent  A  to  B 

(Xi  -  the  slack  of  component  Ni 

Gi  -  the  adjusted  slack  of  component  Ni 

Given  a  directed  acyclic  graph  V  (iV,  E)  the  nodes  N  = 
{iNri,iV2j  ...iVn}  of  this  graph  are  called  components  and 
die  edges  data  paths.  The  acyclic  graph  is  called  a  metapro¬ 
gram.  An  ordering  of  the  nodes  is  a  permutation  P  of  the 
nodes  {Np^ ,  Np^-.-Np^  }  which  preserves  the  order  of  the 
nodes  in  the  graph,  i.e.  if  there  is  an  edge  NiNj  in  the  graph, 
then  Pi  <  Pj. 

Given  a  grid  G  =  {Ifi, i/2, consisting  of  k 
hosts,  and  a  metaprogram  V  =  {N^  E)  a  mapping  of  the 
component  Ni  to  the  grid  is  a  function  associating  a  unique 
host  in  G  to  the  program,  Map{Ni)  =  Hj.  The  mapping  of 
the  metaprogram  to  the  grid  is  a  set: 

Map{V,G)  =  {Map{Ni))  ^Ni  G  N  (1) 

We  associate  with  each  pair  {Ni^Hj)  with  Ni  £  N 
and  Hj  £  G  SL  scalar  called  the  running  time  of  program 
Ni  on  host  Hj,  U  =  RunTime{Ni,Hj).  The  running 
time  of  the  metaprogram  components  on  the  grid  is  the 
set:  RunTimeiy^G)  =  {RunTime{Ni.,Hj)}  "iNi  £ 
N,  Hj  £  G. 

Given  a  metaprogram  V,  a  grid  G,  an  ordering  of  the 
programs,  P,  and  a  mapping  Map{V,G)  we  associate  to 
each  program  a  scalar  value  called  the  startup  time  is^  = 
Start{Ni).  The  startup  time  of  the  metaprogram  com¬ 
ponents  is  the  set  Start(y)  =  {Start{Ni)}  WNi  £ 
N.  The  completion/end  time  of  a  component  is  defined 
as  End{Ni)  =  tf^  =  tsi  +  U  and  the  set  End{V)  = 
{End{Ni)}  '^Ni  £  N  is  called  the  the  completion  time 
of  the  metaprogram  components.  The  goal  component  of  a 
metaprogram  is  the  component  whose  result  is  the  output 


of  the  entire  computation,  it  is  the  last  executed  component, 
and  its  completion  time  tf^  coincides  with  the  completion 
time  of  the  metaprogram. 

The  following  restriction  applies:  the  execution  of  any 
two  components  Ni  and  Nj  of  a  metaprogram  mapped  onto 
the  same  host,  Map{Ni)  =  Map{Nj),  cannot  overlap: 

Pi<Pj^tf,’^ti<ts,  (2) 

Given  a  metaprogram  V  (iV,  E),  the  grid  G,  and  the  set 
RunTime{V,G)  a  schedule  of  the  metaprogram  on  the 
grid  is  the  triplet  consisting  of  the  ordering,  mapping  and 
starting  times  of  all  the  components  of  the  metaprogram, 
{P{N),  Mapiy,  G),  Start{V)).  The  total  running  time  of 
a  schedule  S  on  grid  G  is  T{V,  G,  S)  =  maxi(t/. )  VATj  £ 
N. 

The  deterministic  optimal  scheduling  problem  -  given  a 
metaprogram  V  and  the  set  of  the  running  times  of  its  com¬ 
ponents  on  a  grid,  RunTime{V,  G)  find  the  schedule  which 
minimizes  the  total  running  time,  T(V',G).  The  determin¬ 
istic  nature  of  a  schedule  is  due  to  the  fact  that  the  values  in 
the  set  RunTime{V,  G)  are  deterministic. 

The  nondeterministic  metaprogram  scheduling  problem 
is  a  variant  of  the  deterministic  metaprogram  scheduling 
problem  when  we  assume  that  the  execution  times  in  the 
set  T(y,G,  *)  are  random  variables  with  known  distribu¬ 
tions  and  we  try  to  find  the  schedule  which  minimizes  the 
the  total  running  time.  For  static  scheduling  approach,  we 
can  only  hope  to  minimize  the  mean  execution  time  over  a 
number  of  runs. 

In  practice,  the  distribution  of  the  execution  time  of  a 
program  is  difficult  if  not  impossible  to  obtain.  Analyti¬ 
cal  expressions  can  only  be  obtained  for  some  special  cases 
unlikely  to  be  of  interest.  A  sample  distribution  requires 
empirical  knowledge  about  the  program  and  the  execution 
history  of  the  program. 

Our  model  does  not  account  for  data  migration  delays 
because  we  are  primarily  concerned  with  coarse-grain  dis¬ 
tributed  computing  on  a  high-speed,  low-latency  network. 
The  analysis  may  be  extended  however  to  models  where 
data  migration  has  a  significant  impact  on  the  execution 
times. 

In  the  next  section  we  introduce  the  concept  of  robust¬ 
ness  and  describe  a  technique  to  determine  the  tolerance  of  a 
schedule  to  the  hazards.  An  O(n^)  time  algorithm  is  given. 

2.  The  robustness  of  a  schedule 

An  example  illustrates  the  concept  if  robustness  of  a 
schedule.  In  Figure  1  we  present  a  metaprogram  to¬ 
gether  with  the  estimated  execution  times  of  its  compo¬ 
nents.  These  estimates  refer  to  a  reference  computer,  and 
should  be  adjusted  to  take  into  account  the  performance  of 
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the  computer  on  which  the  execution  takes  place.  The  com¬ 
munication  delays  are  ignored.  Suppose  that  we  want  to 
schedule  the  program  on  a  grid  consisting  of  HI,  H2  and 
H3.  The  reference  computer  is  HI,  while  the  speed  of  H2  is 
50%  and  of  H3  25%  of  the  reference. 


A  (est=1  Os) 

I  Executed  on  H1 
Exec  time  10s 
Finished  at 
t:=10s 

No  spare  time 


Schedule  S, 


C  (est=20s) 
Executed  on  H2 
Exec  time  40s 
I  Finished  at  t=50s  I 
40s  spare  time 


B  (est=80s) 

I  Executed  on  HI 
Exec  time  80s 
Finished  at 
t=90s 

No  spare  time 


D  (est=10s) 

I  Executed  on  H3 
Exec  time  40s 
Finished  at 
t=50s 

40s  spare  time 


E  (est=30s) 
Executed  on  HI 
Exec  time  30s 
Finished  at 
110s. 

No  spare  time 


Figure  1 .  Two  schedules  of  the  same  metapro¬ 
gram  with  the  same  execution  time  but  differ¬ 
ent  robustness.  The  metaprograms  are  exe¬ 
cuted  on  the  hosts  H1,  H2  and  H3  with  the 
relative  speed  1 .0,  0.5  and  0.25  respectively. 


Consider  two  optimal  schedules  with  the  same  execution 
time  of  f  =  120s  for  the  given  grid  (we  leave  the  proof  of 
the  optimality  to  the  reader).  A  more  attentive  examination 
of  the  two  schedules  reveals  an  important  difference  among 
them.  For  5i,  component  D  finishes  at  the  time  t  =  30s, 
while  its  output  is  needed  at  time  ^  =  90  and  the  host  on 


which  it  is  scheduled  (H2)  is  not  used  again  in  this  metapro¬ 
gram.  In  this  case  we  say  that  the  component  D  has  60  sec¬ 
onds  of  spare  time.  Even  if  the  execution  of  D  takes  twice 
the  estimated  value,  the  total  execution  time  of  the  metapro¬ 
gram  will  not  be  affected.  Unfortunately,  none  of  the  other 
components  in  this  schedule  have  spare  time;  we  say  they 
are  critical  For  S2  both  C  and  D  have  40  seconds  of  spare 
time. 

Intuitively,  it  is  obvious  that  the  schedule  on  right,  S2, 
is  ’’better”  than  Si,  the  schedule  on  left  of  Figure  1.  To 
provide  a  quantitative  assessment  of  the  difference  between 
the  two  schedules  assume  a  probability  say  p  =  0.2  that 
a  component  is  late  and  that  the  execution  times  are  inde¬ 
pendent  random  variables.  The  last  assumption  may  not  be 
true  in  practice  systems,  however  it  is  a  good  approxima¬ 
tion  in  cases  when  the  delays  are  caused  by  discrete  events 
independent  upon  the  distributed  application. 

The  probability  that  the  metaprogram  is  late  for  the  two 
schedules  are 

P(ate(5i)=l-(l-p)^  =  0.5904 
Piate(52)  =l-(l-p)"  =  0.4880 

If  the  metaprogram  is  executed  repeatedly,  then  then  the 
expected  running  time  of  a  more  robust  schedule  will  be 
smaller  than  the  expected  running  time  of  the  less  robust 
schedule,  assuming  that  the  running  time  of  each  compo¬ 
nent  has  small  variations  around  its  expected  value. 

Although  our  model  of  the  execution  times  is  naive,  the 
qualitative  result  will  apply  to  any  reasonable  narrow  dis¬ 
tribution  of  the  execution  times.  We  assumed  a  bounded 
execution  time  in  order  to  prove  the  robustness  of  the  algo¬ 
rithm.  In  practice  a  weaker  assumption  along  the  lines  of 
a  very  narrow  distribution  around  the  estimated  execution 
time  should  lead  to  the  same  result. 

2.1.  Data  and  host  dependencies 

In  the  following  we  devise  an  analytic  measure  of  the 
vulnerability  of  a  schedule  to  hazards.  We  are  interested 
in  the  question  how  the  increase  in  the  execution  time  of  a 
component  affects  the  total  execution  time  of  the  metapro¬ 
gram,  or  in  the  positive  effects  of  an  early  termination. 

Given  a  component  Ni  of  the  metaprogram  its  starting, 
execution,  and  completion  time  are  respectively  ,  U,  and 
if. .  If  the  completion  time  is  exceeded  we  say  that  compo¬ 
nent  Ni  is  late.  There  are  two  reasons  for  a  component  to 
be  late: 

•  The  actual  execution  time  of  the  component  is  longer 
than  expected,  f  ■  >ti,  and/or 

•  The  execution  of  the  component  begins  later  than  ex¬ 
pected:  >  ts,  due  to  interactions  among  the  en¬ 

tities  involved.  We  recognize  (a)  data  dependencies 
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H 

Executed  on  H1 
Started  at  t=11  Os 
Terminated  at 
t=:140s 

Figure  2.  Augmentation  of  the  data  depen¬ 
dency  graph  of  a  metaprogram  with  links  cor¬ 
responding  to  the  host  dependencies.  The 
host  dependencies  are  drawn  with  dotted 
iines. 


some  component  was  late  in  creating  data  needed  for 
the  execution  of  component  Ni,(b)  communication  de~ 
pendencies,  the  data  transfer  between  the  producer  and 
the  consumer  takes  longer  than  expected,  and  (c)  host 
dependencies  some  component  was  late  in  completing 
its  execution  on  the  same  host  where  component  Ni 
will  be  executed. 

Host  dependencies  may  be  resolved  by  a  dynamic  sched¬ 
uler,  that  can  reassign  the  component  to  a  different  host. 
This  is  a  nontrivial  problem  on  its  own,  because  the  mod¬ 
ifications  in  the  mapping  may  lead  to  large  performance 
penalties,  and  the  real  delay  is  not  known  at  the  moment  of 
the  scheduling.  We  may  sacrifice  our  precomputed  optimal 
schedule  for  an  insignificant  delay.  In  this  paper  we  deal 
only  with  precomputed  static  schedules.  While  data  and 
communication  dependencies  are  invariants  of  the  metapro¬ 
gram,  the  host  dependency  is  a  property  of  a  particular 
schedule. 

In  the  following  we  introduce  a  robustness  metrics  for 
the  schedules.  In  this  case  data  and  host  dependencies  can 


be  treated  identically.  Our  approach  is  to  augment  the  data 
dependency  graph  of  the  metaprogram  with  the  host  de¬ 
pendency  links  as  shown  in  Figure  2.  If  two  programs  are 
scheduled  one  after  another  on  the  same  host,  a  new  link  is 
to  be  added  between  them.  If  a  data  dependency  link  be¬ 
tween  the  two  components  is  already  in  place  no  new  link 
will  be  added.  For  optimal  schedules  most  of  the  host  de¬ 
pendencies  follow  the  data  dependencies  because  optimal 
schedules  try  to  avoid  moving  data  around. 

Figure  2  illustrates  the  effect  of  host  dependencies.  Com¬ 
ponent  C  terminates  at  i  =  85s,  and  the  data  it  generates  is 
needed  only  by  component  H  which  starts  at  i  =  110,  so 
one  is  tempted  to  believe  that  we  have  a  comfortable  spare 
time  for  C.  However  the  host  dependency  link  from  C  to  G 
shows  that  any  delay  in  terminating  C  will  delay  the  start 
and  implicitly  the  termination  of  component  G  on  the  host 
Hi.  Component  G  is  critical,  so  the  host  dependency  will 
make  the  component  C  critical,  too. 

In  our  analysis  we  do  not  differentiate  between  delays 
caused  by  a  late  start  or  longer  execution  time. 

Given  a  metaprogram  and  a  schedule,  a  shifted  sched¬ 
ule  is  one  where  the  mapping  and  order  of  execution  of  the 
components  on  every  host  is  the  same,  but  the  startup  of 
a  component  is  adjusted  such  that  it  is  launched  immedi¬ 
ately  when  its  data  and  host  dependencies  are  satisfied.  We 
assume  that  the  scheduler  can  automatically  shift  the  sched¬ 
ule,  if  needed. 

An  upper  limit  of  the  effect  of  the  delay  is  given  by  the 
following  theorem. 

Theorem  1  Given  a  metaprogram  V,  a  grid  G,  and  a  static 
schedule  5,  let  the  execution  time  of  each  component  he  U, 
and  the  total  execution  time  be  Tiy^G^S).  If  the  actual 
execution  time  of  each  component  changes  by  there  is 
a  schedule  S'  whose  execution  time  is  smaller  than: 

T{V,G,S)  +  y2^t'i 

where 

/Sii  if  At i  >  0 
0  otherwise 

This  theorem  provides  an  upper  limit  for  the  delay  of  the 
schedules.  However  this  upper  limit  is  very  disappointing, 
and  raises  the  question  if  it  cannot  be  improved.  Consider  a 
critical  path  Peru  =  {Ci ,  G2...Gn}  where  Ci  is  the  starting 
component  and  Cn  is  the  goal  component  of  the  metapro¬ 
gram.  Obviously  a  lower  limit  of  the  total  execution  time 
would  be: 

T(V,G,5)  +  ^Atc, 

This  lower  limit  shows  us  that  if  a  component  on  the  crit¬ 
ical  path  is  late,  this  delay  will  propagate  into  the  total  ex¬ 
ecution  time  of  the  metaprogram.  A  metaprogram  can  also 
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be  late  because  of  components  outside  the  critical  path.  In 
this  paper  we  develop  methods  for  reducing  the  chance  that 
a  metaprogram  is  late  due  to  components  that  are  not  on  the 
critical  path  constructed  assuming  deterministic  execution 
times. 

Unfortunately,  the  influence  of  a  late  termination  is  more 
probable  to  affect  the  schedule  than  an  early  termination. 
This  is  due  because  most  components  have  more  than  one 
dependency.  Any  late  dependency  forces  the  component  to 
be  late,  while  all  dependencies  should  be  early  to  permit  an 
early  start  of  the  component. 

2.2.  Spare  time 

Consider  a  metaprogram  and  a  static  schedule.  Given 
component  A  we  want  to  determine  the  way  in  which  the 
delay  of  component  A  influences  the  execution  time  of  the 
metaprogram. 

We  assume  that  there  is  a  path  from  each  component  to 
the  goal  component  and  there  is  only  one  goal  component. 
If  there  are  more  then  one,  we  introduce  a  final  goal  compo¬ 
nent  which  depends  on  all  of  the  goal  components,  and  has 
zero  execution  time. 

The  spare  time  of  a  link  is  defined  as: 

I'spareiA  ^  5)  =  tf^ 

The  slack  of  a  component  cr^  is  the  minimum  spare  time 
on  any  path  from  the  component  to  the  goal  component, 
or  equivalently  the  length  of  the  shortest  path  to  the  goal 
component  in  the  augmented  graph  of  spare  times.  We  call 
a  component  critical  if  its  slack  is  zero.  Even  if  the  spare 
time  of  a  component  is  zero,  its  slack  can  be  nonzero.  The 
importance  of  the  slack  is  demonstrated  by  the  following 
theorem  given  here  without  a  proof . 

Theorem  2  Consider  a  metaprogram  M,  a  schedule  S  and 
a  component  Ni  with  a  nonzero  slack  Oi.  If  component  Ni 
will  exceed  its  estimated  execution  time  by  AU  <  Oi  the 
shifted  schedule  will  have  the  same  execution  time  as  the 
original  one,  provided  that  all  other  components  meet  their 
deadlines. 

Corollary  1  If  a  component  Ni  has  at  least  one  non- 
critical  component  on  all  paths  to  the  goal  state,  the  com¬ 
ponent  is  non-critical. 

This  definition  gives  us  a  simple  algorithm  to  compute 
the  slack  of  all  components  of  a  metaprogram. 

1.  augment  the  metaprogram  graph  with 
the  host  dependency  links. 

2.  label  each  link  with  the  spare  time 
on  the  link. 


3 .  for  each  component  compute  the 

shortest  path  to  the  goal  component 
in  the  graph  of  the  spare  times. 

The  spare  time  can  be  computed  using  Dijkstra’s  shortest 
path  algorithm  or  improvements  of  it  in  O(n^)  time  [2].  The 
intuition  behind  the  slack  is  that  spare  time  on  the  shortest 
path  from  a  component  to  the  goal  component  may  back- 
propagate  and  allow  a  component  with  no  spare  time  to  be 
late. 

2.3.  Adjusted  slack 

One  of  the  drawbacks  of  using  the  slack  of  a  measure  of 
the  robustness  of  a  schedule,  is  the  fact  that  every  compo¬ 
nent  is  treated  separately.  Theorem  2  proves  the  slack  of 
a  component  as  an  useful  measure  only  for  the  case  when 
all  other  components  meet  their  deadlines,  a  rather  strong 
assumption.  In  a  practical  case  more  than  one  component 
can  be  late.  A  delay  in  a  component  on  which  our  com¬ 
ponent  depends  may  cause  a  decrease  in  the  slack  of  the 
current  component.  Intuitively,  the  same  slack  a  is  better 
at  the  beginning  of  a  computation,  than  close  to  the  end, 
where  possible  delays  from  earlier  components  can  ’’chop 
of’  parts  of  it. 

We  are  interested  in  a  measure  which  proves  a  result  sim¬ 
ilar  to  Theorem  2,  but  allows  more  than  one  component  to 
be  late.  In  the  general  case  it  is  difficult  to  construct  such 
a  measure  because  it  depends  on  the  distribution  of  the  ex¬ 
ecution  time  of  the  previous  components.  In  the  following 
we  assume  that  the  actual  execution  time  of  component  Ni 
does  not  exceed  its  estimated  execution  time  by  more  than 
a  factor  q  >  <  qxUmd  that  q  is  the  same  for  all 

components.  This  assumption  bounds  the  starting  time  of  a 
component  as  well 

ts,<qx  tsi 

A  bounded  time  metaprogram  is  one  where  the  execution 
times  of  all  components  are  bounded. 

We  define  the  adjusted  slack  as  the  slack  modified  to  ac¬ 
count  for  a  late  start. 

Oi  -  max(0,ai  -  (g  -  Ifisi) 

Theorem  3  Consider  a  bounded  time  metaprogram  V  and 
a  schedule  S.  If  AU  <  ai  \/Ni  G  N  then  there  is  a  shifted 
schedule  S'  with  the  same  execution  time  as  the  original 
schedule  S. 


This  theorem  shows  that  the  adjusted  slack  is  a  more  use¬ 
ful  metric  than  the  slack,  because  allows  any  of  the  compo¬ 
nents  to  be  late  within  the  limits  of  its  adjusted  slack. 
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We  call  a  component  safe  if  its  adjusted  slack  is  larger 
than  the  upper  bound  of  delay  on  the  component.  A  safe 
component  can  not  cause  the  total  execution  time  to  be  late 
(provided  that  the  upper  bound  on  delays  holds).  An  imme¬ 
diate  application  of  this  analysis  is  the  identification  of  the 
safe  components. 

2.4.  Potentially  critical  paths  and  components 

Call  Ti  the  cost  of  a  path  Pi,  hi  the  cumulative  effect  of 
the  hazards  on  that  path,  and  T/  =  Ti  +  hi  the  cost  of  the 
path  in  the  presence  of  the  hazards,  and  write  Pi  =  {Ti^hi). 

The  effect  of  the  hazards  partitions  the  set  V  of  all  paths 
into  two  disjoint  subsets,  Vc  paths  that  have  the  potential 
of  becoming  critical  path,  and  Vn  paths  that  cannot  become 
critical,  V  =  Vc^Vn  such  that 

min(T/,  G  Vc)  >  max(Tj,  VPj  G  Vn) 

We  call  a  component  Ni  potentially  critical  if  it  appears 
on  at  least  one  potentially  critical  path.  The  set  of  compo¬ 
nents  is  Af  =  Afc^Afn  with  A4  the  set  of  potentially  critical 
components  and  Afn  components  that  are  not  critical. 

The  analysis  supported  by  Theorems  2  and  3  can  be  re¬ 
stricted  only  to  components  in  Afc.  Theorem  1  provides 
a  tighter  bound  when  applied  only  to  components  in  A4- 
For  example  consider  a  metaprogram  with  3  paths  Pi  = 
(1000,10),  P2  =  (10,600)  and  P3  =  (60,800).  In  this 
case  Pc  =  {Pi}  andPn  =  {P2,P3}.  Assume  that  hazards 
effect  only  one  component  in  each  path,  and  these  compo¬ 
nents  are  different.  Applied  to  the  entire  set  of  components 
Af  =  1410,  and  applied  to  Afc  we  have  Af'  =  10. 

The  concept  of  robustness  of  a  schedule  can  be  expressed 
in  terms  of  path  criticality.  If  we  call  the  probability 
that  path  Pi,  1  <  i  <  n  becomes  critical  subject  to  haz¬ 
ards  h  and  Pi  =  ^  then  we  can  define  the  entropy  of 
a  schedule  as 

i=l 

If  a  schedule  with  n  paths  has  only  one  critical  path  say 
Pi  then  Pi  =  1,  Pi  =  0  i  =  2..n  and  we  have  H{S)  =  0. 
If  all  n  paths  can  become  critical  subject  to  hazards  h  then 
Pi=  ^  and  H{S)  =  logn. 

The  entropy  of  the  path  is  then  a  measure  of  robustness. 
We  say  that  schedule  Sj  is  more  robust  than  Si  if 

Example: 

Let  us  label  the  three  paths  of  schedule  Si  in  Figure  1 
from  left  to  right  as  Pi,  P2  and  P3.  Assuming  that  the  haz¬ 
ards  lead  to  an  increase  of  the  execution  time  no  more  that 
30s,  we  have 


Pi=P2  =  ^  P3  =  0  H{Si)  =  l 

For  S2  the  paths  are  P/,  and  and 

Pi  =  P'2  =  0  P3  =  1  H{S2)  =  0 

3  Constructing  robust  schedules 

In  this  section  we  show  how  the  robustness  analysis 
can  be  used  to  improve  a  scheduling  algorithm.  Given  a 
metaprogram  V  and  a  grid  G  we  can  partition  the  set  of 
all  schedules  into  equivalence  classes  based  upon  the  to¬ 
tal  execution  time  of  the  metaprogram.  Schedules  with  the 
same  execution  time  form  a  class  of  iso-schedules,  they  ei¬ 
ther  have  the  same  critical  path  or  equal  cost  (execution 
time)  critical  paths  and  in  a  deterministic  case  are  indistin¬ 
guishable  from  one  one  another  but  exhibit  different  perfor¬ 
mance  under  non-deterministic  component  execution  time 
assumptions.  In  this  paper  we  discuss  metrics  to  differenti¬ 
ate  amongst  the  members  of  a  class  based  upon  the  robust¬ 
ness. 

Two  scenarios  are  studied:  in  the  first  one  we  assume  a 
real  time  system  where  our  goal  is  to  maximize  the  number 
of  safe  components,  while  in  the  second  one  we  consider  a 
common  metaprogram  where  the  objective  is  to  minimize 
the  average  execution  time  over  a  number  of  runs. 

3.1  Scenario  1:  Real  time  system 

In  this  scenario  we  consider  a  real  time  distributed  sys¬ 
tem,  where  meeting  the  deadlines  is  critical. 

A  component  can  be  executed  using  static  allocation  and 
spatial  partitioning  of  the  grid.  We  call  this  case  strict 
scheduling  conditions  and  assume  that  a  component  sched¬ 
uled  this  way  will  meet  its  deadline.  An  alternative  way  of 
executing  a  component  on  a  grid  is  by  dynamic  allocation 
based  upon  temporal  partitioning  of  the  grid  ( weak  schedul¬ 
ing  conditions).  This  execution  mode  is  more  efficient  from 
the  point  of  view  of  resource  utilization,  but  may  cause  de¬ 
lays  in  the  execution  of  the  components  because  a  compo¬ 
nent  may  have  to  wait  for  a  resource  used  by  another  com¬ 
ponent.  We  will  assume  that  even  under  weak  scheduling 
conditions  we  have  an  upper  bound  of  the  execution  time 

I'max  —  Q  t. 

Strict  scheduling  decreases  the  throughput  of  the  system. 
We  are  interested  in  minimizing  the  number  of  components 
which  need  strict  scheduling. 

Theorem  3  shows  that  if  the  adjusted  slack  of  a  compo¬ 
nent  is  larger  than  the  bound  on  execution  time  (safe  com¬ 
ponents),  the  possible  variation  in  the  execution  time  can 
not  affect  the  total  execution  time,  regardless  of  the  rest  of 
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the  system.  This  implies  that  a  safe  component  does  not 
require  strict  scheduling. 

Our  goal  is  to  maximize  the  number  of  safe  components, 
and  to  keep  the  estimated  execution  time  smaller  or  equal 
than  the  deadline.  An  outline  of  a  time-limited  heuristic 
search  robust  algorithm  is  given  below: 

while (not  timeout) 

1.  find  a  new  schedule  according  to  the 

heuristic 

2.  if  the  execution  time  is  longer 

than  the  deadline,  reject  it 

3 .  otherwise 

3a.  perform  robustness  analysis 
3b.  compute  the  num¬ 
ber  of  safe  components 

3c.  if  (number  of  safe  compo¬ 
nents  larger 

than  in  the  cur¬ 
rent  schedule) 

{ 

make  it  the  current  schedule 

} 

return  current  schedule 
} 

% 


and/or  selecting  the  schedules  which  maximizes  the  number 
of  such  components. 


Figure  3.  The  effect  of  the  robustness  analysis 
upon  the  resource  allocation.  On  the  sched¬ 
ule  on  left  host  Hj  is  allocated  exclusively  to 
the  metaprogram  at  time  ti  and  deallocated  at 
time  te.  On  the  schedule  on  the  right  the  re¬ 
source  is  available  for  other  tasks  during  the 
intervals  {ti,  ^2)  and  (^5,  te),  it  executes  non- 
critical  tasks  in  {t2,  ts)  and  {t4,  h)  and  critical 
tasks  in  fe,  ^4) 


The  same  principle  can  be  used  to  create  a  genetic  algo¬ 
rithm  for  the  same  objective  function. 

Identification  of  safe  components  has  important  conse¬ 
quences  upon  resource  utilization  in  case  of  strict  schedul¬ 
ing  conditions.  Figure  3  shows  the  resource  allocation 
graph  for  the  components  of  the  grid.  The  graph  (a)  shows 
the  static  allocation  of  all  the  resources  of  the  grid  for  the 
time  of  execution  of  the  metaprogram.  The  gray  contour  on 
the  graph  (b)  shows  the  actual  time  intervals  where  differ¬ 
ent  components  are  executed  on  the  elements  of  the  grid. 
The  lower  boundary  of  the  contour  is  formed  by  the  start¬ 
ing  times  of  the  first  component,  while  the  upper  bound¬ 
ary  by  the  finishing  times  of  the  last  component  executed 
on  the  specific  host.  Inside  this  we  have  a  hashed  contour 
which  represent  the  time  frame  where  critical  components 
are  executed.  This  region  represents  the  time  and  space 
frame  where  resource  allocation  is  required.  In  the  gray 
region  only  safe  components  are  executed,  which  does  not 
require  resource  allocation,  while  in  the  white  region  no 
component  of  the  metaprogram  is  executed.  The  hashed  re¬ 
gion  corresponding  to  the  strict  resource  allocation  require¬ 
ments  still  span  the  entire  time  between  the  start  and  end  of 
the  metaprogram  -  corresponding  to  the  critical  path  of  the 
schedule.  The  robustness  analysis  permits  a  dynamic  allo¬ 
cation  of  resources  by  identifying  components  outside  the 
critical  path  which  do  not  require  strict  resource  allocation 


3.2  Scenario  2:  Common  metaprograms 


In  this  scenario  we  are  assuming  that  our  system  is  not 
critical  (i.e.  there  is  no  deadline).  We  are  still  assuming  a 
bounded  distribution  of  the  execution  time  around  the  esti¬ 
mated  value.  In  this  case  we  are  interested  in  obtaining  a 
minimal  average  execution  time  for  the  distributed  applica¬ 
tion  for  a  number  of  runs. 

To  design  a  search  algorithm  we  need  a  unique  robust¬ 
ness  measure^  to  compare  schedules.  A  good  robustness 
measure  should  have  the  following  properties: 

•  Increase  in  the  components  slack. 

•  Penalize  the  slack  larger  than  the  time  bound. 


Unfortunately,  devising  a  measure  which  accurately  pre¬ 
dict  which  of  the  iso-schedules  gives  the  minimal  average 
execution  time  depends  on  the  shape  of  the  distribution  of 
execution  times,  an  information  difficult  to  obtain  in  prac¬ 
tice.  We  are  proposing  an  empirical  formula  which  has  the 
advantage  of  being  simple,  easy  to  compute,  and  performs 
well  in  experiments. 
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where  7  G  (0, 1]  is  a  constant  depending  on  the  shape  of 
the  distribution  and  can  be  determined  experimentally. 

An  outline  of  a  timeout-limited  heuristic  search  robust 
algorithm  is  given  below: 

while (not  timeout) 

1.  find  a  new  schedule  accord¬ 
ing  to  the 

heuristic 

2 .  if  the  execu¬ 
tion  time  is  longer  than 

the  current  schedule  reject  it 

3 .  if  it  is  shorter 

make  it  the  current  schedule 

3.  if  they  have  equal  length 

3. a  perform  robustness  analysis 
on  the  new  schedule 
3.b  compute  the  robust¬ 
ness  measure 

R(S)  for  the  new  schedule 
3. c  if  it  is  larger  than  for  the 
current  schedule 
make  it  the  cur¬ 
rent  schedule 
return  current  schedule 

4  Experimental  results 

In  this  section  we  study  two  questions: 

•  What  is  the  cost  of  taking  into  consideration  the  ro¬ 
bustness  in  scheduling? 

•  What  is  the  average  improvement  achievable  by  robust 
scheduling? 

The  answer  of  these  question  depends  on  the  nature  of 
applications,  the  distribution  of  the  execution  times,  so  a 
general  answer  is  difficult  to  give.  An  extensive  testing  pro¬ 
cess  was  performed  using  the  Bond  environment  [6]  to  build 
up  some  confidence  in  the  results. 

All  examples  presented  in  these  article  were  hand¬ 
crafted  in  order  to  prove  the  validity  of  ideas.  Neverthe¬ 
less  they  did  not  answer  an  important  question:  how  often 
in  practice  the  possibility  of  improving  the  schedule  using 
the  robustness  metric  arises,  and  how  dramatic  these  im¬ 
provements  are?  These  questions  can  not  be  answered  with¬ 
out  a  knowledge  of  the  application  domain,  some  type  of 
metaprograms  may  permit  more  optimizations  than  others. 

Our  experiments  were  made  using  randomly  generated 
schedules.  We  were  motivated  by  the  desire  to  provide  a  fair 
evaluation  of  the  algorithms  provided  in  this  paper.  Care¬ 
fully  selected  schedules  may  favor  an  algorithm  with  a  poor 
performance  on  most  schedules. 


The  random  schedules  were  generated  using  following 
algorithm: 

1 .  The  number  of  components  NC  is  gen¬ 
erated  as  a  random  number  in  the 
range  15.. 30.  The  type  of  each  com¬ 
ponent  was  chosen  randomly  from  a 
collection  of  5  different  compo¬ 
nents. 

2 .  The  number  of  links  are  generated  in 

the  range  1.,  NC*(NC-l)/2 

3 .  The  links  are  generated  by  gener¬ 
ating  random  pairs  of  numbers .  The 
connection  always  goes  from  lower 
ranking  to  higher  ranking  component, 
in  order  to  avoid  cycles. 

In  the  testing  process  100  random  metaprograms  were 
generated.  For  each  metaprogram  a  time-out  limited  opti¬ 
mal  static  scheduling  algorithm  was  applied.  The  algorithm 
was  modified  to  maintain  the  optimal  schedule  according  to 
the  robustness  measures  presented  for  the  two  scenarios  pre¬ 
sented  in  Section  3.  The  robustness  analysis  was  performed 
only  if  needed  (i.e.  if  the  decision  could  not  be  done  using 
the  total  execution  time).  Every  schedule  was  run  200  times 
with  the  execution  times  as  a  random  normal  distribution 
with  the  mean  being  the  estimated  execution  time,  and  the 
upper  bound  being  q  =  1.4. 

We  have  collected  the  following  data: 

•  The  number  of  schedules  checked,  and  the  number  of 
iso-schedules  found. 

•  The  number  of  schedules  for  which  the  robustness 
analysis  was  performed. 

•  The  best  and  worst  number  of  safe  components. 

•  The  minimal  average  execution  time,  and  the  execu¬ 
tion  time  of  the  first  optimal  schedule  found  (which 
would  have  been  retained  without  the  robustness  anal¬ 
ysis) 

In  our  algorithm  the  robustness  analysis  was  performed 
only  when  the  total  execution  times  were  identical. 

The  robustness  analysis  was  performed  129,683  times 
for  the  8,768,794  schedules  generated,  representing  less 
than  1.5%  of  the  cases.  We  can  conclude  that  the  robust¬ 
ness  analysis  is  easy  to  implement  and  cheap  in  respect  to 
the  computational  demand. 

The  Table  4  shows  the  relative  improvement  in  the  num¬ 
ber  of  safe  components  and  mean  execution  time. 

The  results  show  an  average  of  67%  increase  in  the  num¬ 
ber  of  safe  components.  The  improvement  in  the  execution 
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without 

robustness 

analysis 

with 

robustness 

analysis 

No.  of  safe  components 

5.23 

8.74 

Mean  execution  time  for 
200  runs 

68.93s 

67.91s 

Table  1.  The  effects  of  the  robustness  anal¬ 
ysis  on  the  number  of  safe  components  and 
the  mean  execution  time  of  a  metaprogram 
-  the  numbers  represent  the  average  of  100 
randomly  generated  metaprograms 


time  is  not  as  spectacular,  being  in  the  range  of  1.5%.  Nev¬ 
ertheless,  1.5%  improvement  in  the  execution  time  repre¬ 
sents  approximately  10%  improvement  in  the  expected  de¬ 
lay.  For  the  best  case  tested  the  number  of  safe  components 
was  improved  from  3  to  16,  and  the  mean  execution  time  de¬ 
creased  by  6%.  In  the  worst  case  there  was  no  improvement 
in  the  indicators.  The  variation  of  the  results  is  caused  by 
the  internal  structure  of  the  metaprograms.  The  robustness 
analysis  cannot  create  spare  times,  it  can  only  redistribute  it 
to  places  where  is  needed,  and  even  these  redistribution  has 
specific  limitations. 

As  a  side  effect  of  the  robustness  analysis  the  safe  com¬ 
ponents  of  a  schedule  are  identified.  Even  if  we  can  not 
increase  their  number,  due  to  the  specifics  of  the  metapro¬ 
gram,  by  identifying  the  safe  components  we  can  employ 
dynamic  resource  allocation,  reducing  the  cost  of  running 
the  metaprogram. 

As  a  conclusion,  robustness  analysis  is  a  practical 
method  to  improve  the  quality  of  metaprogram  scheduling 
algorithms.  We  argue  that  the  robustness  analysis  can  be  in¬ 
tegrated  into  scheduling  algorithms  with  ease  and  at  a  low 
cost  and  that  the  gain  in  terms  of  better  quality  schedules 
outweighs  the  computation  and  implementation  costs. 

5.  Conclusions 

Scheduling  of  dependent  tasks  with  deterministic  execu¬ 
tion  times  is  known  to  be  NP  complete.  The  problem  of 
scheduling  tasks  whose  execution  time  is  non-deterministic 
as  a  result  of  various  hazards  is  conceptually  more  challeng¬ 
ing. 

Given  a  schedule  for  an  augmented  dependency  graph 
one  can  construct  classes  of  iso-schedules,  schedules  with 
the  same  cost  of  the  critical  path  (the  cost  of  the  critical 
path  is  the  elapsed  time  from  the  initial  component  to  the 
goal  component).  Some  of  the  schedules  in  one  class  are 
less  sensitive  to  the  effects  of  the  hazards  because  compo¬ 
nents  on  non-critical  path  have  some  spare  time  or  slack  in 
our  terminology.  The  startup  time  of  such  a  component  can 


be  delayed  or  its  execution  time  may  take  longer  in  such 
a  ’’shifted  schedule”,  without  increasing  the  total  execution 
time  of  the  entire  graph.  We  call  this  property  of  a  schedule 
robustness. 

In  this  paper  we  assume  bounded  variations  of  the  exe¬ 
cution  time  of  components  and  devise  an  analytic  measure 
of  the  vulnerability  of  a  schedule  to  hazards.  We  provide 
an  upper  bound  for  the  execution  time  of  a  schedule  subject 
to  hazards  and  prove  two  theorems  regarding  shifted  sched¬ 
ules. 

We  introduce  the  concept  of  a  critical  component,  one 
whose  increase  of  the  execution  time  due  to  hazards  may 
cause  the  execution  path  to  become  critical.  Then  we  dis¬ 
cuss  measures  of  robustness.  A  possible  measure  is  the 
number  of  critical  components  within  a  schedule,  the  fewer, 
the  more  robust  the  schedule.  An  alternative  measure  of  ro¬ 
bustness  introduced  in  this  paper  is  the  entropy  of  a  sched¬ 
ule.  The  entropy  of  a  schedule  is  based  upon  the  probability 
of  an  execution  path  of  becoming  critical.  In  the  general 
case,  determining  this  probability  is  a  non-trivial  task  and 
more  research  is  needed  before  this  measure  may  prove  its 
usefulness. 

Elsewhere  [4]  we  provide  the  proofs  to  the  theorems  in 
this  paper. 
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Abstract 

A  major  challenge  in  Metacomputing  Systems  (Compu- 
tational  Grids)  is  to  effectively  use  their  shared  resources, 
such  as  compute  cycles,  memory,  communication  network, 
and  data  repositories,  to  optimize  desired  global  objectives. 
In  this  paper  we  develop  a  unified  framework  for  resource 
scheduling  in  metacomputing  systems  where  tasks  with  var- 
ions  requirements  are  submitted  from  participant  sites.  Our 
goal  is  to  minimize  the  overall  execution  time  of  a  collection 
of  application  tasks.  In  our  model,  each  application  task  is 
represented  by  a  Directed  Acyclic  Graph  (DAG).  A  task  con¬ 
sists  of  several  subtasks  and  the  resource  requirements  are 
specified  at  subtask  level.  Our  framework  is  general  and  it 
accommodates  emerging  notions  of  Quality  of  Service  (QoS) 
and  advance  resource  reservations.  In  this  paper,  we  present 
several  scheduling  algorithms  which  consider  compute  re¬ 
sources  and  data  repositories  that  have  advance  reserva¬ 
tions.  As  shown  by  our  simulation  results,  it  is  advantageous 
to  schedule  the  system  resources  in  a  unified  manner  rather 
than  scheduling  each  type  of  resource  separately.  Our  al¬ 
gorithms  have  at  least  50%  improvement  over  the  separated 
approach  with  respect  to  completion  time. 


1.  Introduction 

With  the  improvements  in  communication  capability 
among  geographically  distributed  systems,  it  is  attractive  to 
use  diverse  set  of  resources  to  solve  challenging  applica¬ 
tions.  Such  Heterogeneous  Computing  (HC)  systems  [12, 
17]  are  called  metacomputing  systems  [26]  or  computa¬ 
tional  grids  [8].  Several  research  projects  are  underway, 
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including  for  example,  MSHN  [22],  Globus  [13],  and  Le¬ 
gion  [19],  in  which  the  users  can  select  and  employ  re¬ 
sources  at  different  domains  in  a  seamless  manner  to  execute 
their  applications.  In  general,  such  metacomputing  systems 
will  have  compute  resources  with  different  capabilities,  dis¬ 
play  devices,  and  data  repositories  all  interconnected  by  het¬ 
erogeneous  local  and  wide  area  networks.  A  variety  of  tools 
and  services  are  being  developed  for  users  to  submit  and  ex¬ 
ecute  their  applications  on  a  metacomputing  system. 

A  major  challenge  in  using  metacomputing  systems  is 
to  effectively  use  the  available  resources.  In  a  metacom¬ 
puting  environment,  applications  are  submitted  from  vari¬ 
ous  user  sites  and  share  system  resources.  These  resources 
include  compute  resources,  communication  resources  (net¬ 
work  bandwidth),  and  data  repositories  (file  servers).  Pro¬ 
grams  executing  in  such  an  environment  typically  consist 
of  one  or  more  subtasks  that  communicate  and  cooperate  to 
form  a  single  application.  Users  submit  jobs  from  their  sites 
to  a  metacomputing  system  by  sending  their  tasks  along  with 
Quality  of  Service  (QoS)  requirements. 

Task  scheduling  in  a  distributed  system  is  a  classic  prob¬ 
lem  (for  a  detailed  classification  see  [5,  6]).  Recently,  there 
have  been  several  works  on  scheduling  tasks  in  metacom¬ 
puting  systems.  Scheduling  independent  jobs  (meta-tasks) 
has  been  considered  in  [2,  11,  14].  For  application  tasks 
represented  by  Directed  Acyclic  Graphs  (DAGs),  many  dy¬ 
namic  scheduling  algorithms  have  been  devised.  These  in¬ 
clude  the  Hybrid  Remapper  [20],  the  Generational  algo¬ 
rithm  [9],  as  well  as  others  [15,  18].  Several  static  algo¬ 
rithms  for  scheduling  DAGs  in  metacomputing  systems  are 
described  in  [16,  23, 24,  25,  27].  Most  of  the  previous  algo¬ 
rithms  focus  on  compute  cycles  as  the  main  resource.  Also, 
previous  DAGs  scheduling  algorithms  assume  that  a  sub¬ 
task  receives  all  its  input  data  from  its  predecessor  subtasks. 
Therefore,  their  scheduling  decisions  are  based  on  machine 
performance  for  the  subtasks  and  the  cost  of  receiving  input 
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data  from  predecessor  subtasks  only. 

Many  metacomputing  applications  need  other  resources, 
such  as  data  repositories,  in  addition  to  compute  resources. 
For  example,  in  data-intensive  computing  [21]  applications 
access  high- volume  data  from  distributed  data  repositories 
such  as  databases  and  archival  storage  systems.  Most  of 
the  execution  time  of  these  applications  is  in  data  move¬ 
ment.  These  applications  can  be  computationally  demand¬ 
ing  and  communication  intensive  as  well  [21].  To  achieve 
high  performance  for  such  applications,  the  scheduling  de¬ 
cisions  must  be  based  on  all  the  required  resources.  Assign¬ 
ing  a  task  to  the  machine  that  gives  its  best  execution  time 
may  result  in  poor  performance  due  to  the  cost  of  retriev¬ 
ing  the  required  input  data  from  data  repositories.  In  [4], 
the  impact  of  accessing  data  servers  on  scheduling  decisions 
has  been  considered  in  the  context  of  developing  an  AppLes 
agent  for  the  Digital  Sky  Survey  Analysis  (DSSA)  applica¬ 
tion.  The  DSSA  AppLes  selects  where  to  run  a  statistical 
analysis  according  to  the  amount  of  required  data  from  data 
servers.  However,  the  primary  motivation  was  to  optimize 
the  performance  of  a  particular  application. 

In  this  paper  we  develop  a  unified  framework  for  resource 
scheduling  in  metacomputing  systems.  Our  framework  con¬ 
siders  compute  resources  as  well  as  other  resources  such  as 
the  communication  network  and  data  repositories.  Also,  it 
incorporates  the  emerging  concept  of  advance  reservations 
where  system  resources  can  be  reserved  in  advance  for  spe¬ 
cific  time  intervals.  In  our  framework,  application  tasks  with 
various  requirements  are  submitted  from  participant  sites. 
An  application  task  consists  of  subtasks  and  is  represented 
by  a  DAG.  The  resource  requirements  are  specified  at  the 
subtask  level.  A  subtask’s  input  data  can  be  data  items  from 
its  predecessors  and/or  data  sets  from  data  repositories.  A 
subtask  is  ready  for  execution  if  all  its  predecessors  have 
completed,  and  it  has  received  all  the  input  data  needed  for 
its  execution.  In  our  framework,  we  allow  for  input  data  sets 
to  be  replicated,  i.e.,  the  data  set  can  be  accessed  from  one  or 
more  data  repositories.  Additionally,  a  task  can  be  submit¬ 
ted  with  QoS  requirements,  such  as  needed  compute  cycles, 
memory,  communication  bandwidth,  maximum  completion 
time,  priority,  etc.  In  our  framework,  sources  of  input  data 
and  the  execution  times  of  the  subtasks  on  various  machines 
along  with  their  availability  are  considered  simultaneously 
to  minimize  the  overall  completion  time. 

Although  our  unified  framework  allows  many  factors  to 
be  taken  into  account  in  resource  scheduling,  in  this  pa¬ 
per,  to  illustrate  our  ideas,  we  present  several  heuristic  al¬ 
gorithms  for  a  resource  scheduling  problem  where  the  com¬ 
pute  resources  and  the  data  repositories  have  advance  reser¬ 
vations.  These  resources  are  available  to  schedule  subtasks 
only  during  certain  time  intervals  as  they  are  reserved  (by 
other  users)  at  other  times.  QoS  requirements  such  as  dead¬ 
lines  and  priorities  will  be  included  in  future  algorithms. 


#  The  objective  of  our  resource  scheduling  algorithms  is  to 
minimize  the  overall  completion  time  of  all  the  submitted 
tasks. 

Our  research  is  a  part  of  the  MSHN  project  [22],  which 
is  a  collaborative  effort  between  DoD  (Naval  Postgraduate 
School),  academia  (NPS,  USC,  Purdue  University),  and  in¬ 
dustry  (NOEMIX).  MSHN  (Management  System  for  Het¬ 
erogeneous  Networks)  is  designing  and  implementing  a  Re¬ 
source  Management  System  (RMS)  for  distributed  hetero¬ 
geneous  and  shared  environments.  MSHN  assumes  hetero¬ 
geneity  in  resources,  processes,  and  QoS  requirements.  Pro¬ 
cesses  may  have  different  priorities,  deadlines,  and  com¬ 
pute  characteristics.  The  goal  is  to  schedule  shared  compute 
and  network  resources  among  individual  applications  so  that 
their  QoS  requirements  are  satisfied.  Our  scheduling  algo¬ 
rithms,  or  their  derivatives,  may  be  included  in  the  Schedul¬ 
ing  Advisor  component  of  MSHN. 

This  paper  is  organized  as  follows.  In  the  next  section 
we  introduce  our  unified  resource  scheduling  framework.  In 
Section  3,  we  present  several  heuristic  algorithms  for  solv¬ 
ing  a  general  resource  scheduling  problem  which  considers 
input  requirements  from  data  repositories  and  advance  reser¬ 
vations  for  system  resources.  Simulation  results  are  pre¬ 
sented  in  Section  4  to  demonstrate  the  performance  of  our 
algorithms.  Finally,  Section  5  gives  some  future  research  di¬ 
rections. 

2.  The  Scheduling  Framework 
2.1,  Application  Model 

In  the  metacomputing  system  we  are  considering,  n  ap¬ 
plication  tasks,  {Ti , . . . ,  T„},  compete  for  computational  as 
well  as  other  resources  (such  as  communication  network  and 
data  repositories).  Each  application  task  consists  of  a  set  of 
communicating  subtasks.  The  data  dependencies  among  the 
subtasks  are  assumed  to  be  known  and  are  represented  by  a 
Directed  Acyclic  Graph  (DAG),  G  =  (VyE).  The  set  of 
subtasks  of  the  application  to  be  executed  is  represented  by 
V={vi ,  t'2 , . . . ,  Vfc  }  where  v/t  >  1 ,  and  E  represents  the  data 
dependencies  and  communication  between  subtasks.  Cij  in¬ 
dicates  communication  from  subtask  Vi  to  Vj ,  and  |e,j|  rep¬ 
resents  the  amount  of  data  to  be  sent  from  Vi  to  vj .  Figure  1 
shows  an  example  with  two  application  tasks.  In  this  exam¬ 
ple,  task  1  consists  of  three  subtasks,  and  task  2  consists  of 
nine  subtasks. 

In  our  framework,  QoS  requirements  are  specified  for 
each  task.  These  requirements  include  needed  compute  cy¬ 
cles,  memory,  communication  bandwidth,  maximum  com¬ 
pletion  time,  etc.  In  our  model,  a  subtask’s  input  data  can  be 
data  items  from  its  predecessors  and/or  data  sets  from  data 
repositories.  All  of  a  subtask’s  input  data  (the  data  items  and 
the  data  sets)  must  be  retrieved  before  its  execution.  After 
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Task  1  Task  2 


Figure  1.  Example  of  application  tasks 

a  subtask’s  completion,  the  generated  output  data  may  be 
forwarded  to  successor  subtasks  and/or  written  back  to  data 
repositories. 

In  some  applications,  a  subtask  may  contain  sub¬ 
subtasks.  For  example.  Adaptive  Signal  Processing  (ASP) 
applications  are  typically  composed  of  a  sequence  of  com¬ 
putation  stages  (subtasks).  Each  stage  consists  of  a  number 
of  identical  sub-subtasks  (i.e.,  FFT’s,  QR  decompositions, 
etc.).  Each  stage  repeatedly  receives  its  input  from  the 
previous  stage,  performs  computations,  and  sends  its  output 
to  the  next  stage. 


tween  the  data  repositories  and  the  machines,  respectively. 
DataSet[i]  gives  the  amount  of  input  data  sets  needed  from 
data  repositories  for  subtask  Vj.  In  systems  with  multiple 
copies  of  data  sets,  one  or  more  data  repository  can  provide 
the  required  data  sets  for  that  subtask. 

2,3.  Problem  Statement 

Our  goal  is  to  minimize  the  overall  execution  time 
for  a  collection  of  applications  that  compete  for  system 
resources.  This  strategy  (i.e.,  optimizing  the  performance 
of  a  collection  of  tasks  as  opposed  to  that  of  a  single  appli¬ 
cation)  has  been  taken  by  SmartNet  [11]  and  MSHN  [22]. 
On  the  other  hand,  the  emphasis  in  other  projects  such  as 
AppLes  [3]  is  to  optimize  the  performance  of  an  individual 
application  rather  than  to  cooperate  with  other  applications 
sharing  the  resources.  Since  multiple  users  share  the 
resources,  optimizing  the  performance  of  an  individual 
application  may  dramatically  affect  the  completion  time  of 
other  applications. 

We  now  formally  state  our  resource  scheduling  prob¬ 
lem. 

Given: 

♦  A  Metacomputing  system  with  m  machines  and  /  data 
repositories, 

•  Advance  reserved  times  for  system  resources  as  given 
by  MA  and  SA, 

#  n  application  tasks,  {Ti , . . . ,  Tn}, where  each  applica¬ 
tion  is  represented  by  a  DAG, 


2.2.  System  Model 

The  metacomputing  system  consists  of  m  heterogeneous 
machines,  M  ={7ni ,  mo, . .  - ,  and  /  file  servers  or 
data  repositories,  S  =  {si,S2,  ^ We  assume  that  an 
estimate  of  the  execution  time  of  subtask  Vi  on  machine 
iiij  is  available  at  compile-time.  These  estimated  execu¬ 
tion  times  are  given  in  matrix  ECT.  Thus,  ECT{i,  j)  gives 
the  estimated  computation  time  for  subtask  i  on  machine 
j.  If  subtask  Vi  cannot  be  executed  on  machine  mj,  then 
ECT{i,  j)  is  set  to  infinity. 

System  resources  may  not  be  available  over  some  time 
intervals  due  to  advance  reservations.  Available  time  inter¬ 
vals  for  machine  mj  are  given  by  MA[j],  Available  time 
intervals  for  data  repository  sj  are  given  by  SA[j].  Ma¬ 
trices  TR  and  L  give  the  message  transfer  time  per  byte 
and  the  communication  latency  between  machines  respec¬ 
tively.  Matrices  Dai  a  AT R  and  Data^L  specify  the  message 
transfer  time  per  byte  and  the  communication  latency  be¬ 


•  Communication  latencies  and  transfer  rates  among  the 
various  resources  in  matrices  TR,  L,  DataTTR^  and 
Data^L, 

•  Subtasks  execution  times  on  various  machines  in  ma¬ 
trix  ETC,  and 

•  Amount  of  input  data  sets  needed  from  data  reposito¬ 
ries  for  each  subtask  Vi  as  given  by  DataSet[i]. 

Find  a  schedule  to 


Minimize  {nmx  [Finish  Time{Tj)]  }, 


where  the  schedule  determines,  for  each  subtask,  the  start 
time  and  the  duration  of  all  the  resources  needed  to  execute 
that  subtask. 


Subject  to  the  following  constraints: 
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■  Q  Subtask  1 
[xl  Subtask  2 
^  Subtask  3 
Subtask  4 
jg  Subtask  5 


Figure  2.  Application  DAG  for  the  example  in 
Sec.  2.4 
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Table  1.  Subtask  execution  times 
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Table  2.  Transfer  costs  (time  units/data  unit) 


Subtask 

Amount  of  the  Input 
Data  Set 

Data  Repository 
Choices 

Vi 

3  units 

Si  or  S2 

V2 

10  units 

S2  or  S3 

Va 

2  units 

Si  or  S3 

V4 

1  unit 

Si  or  S2 

Vs 

5  units 

S3 

Table  3.  Input  requirements  for  the  subtasks 


•  A  subtask  can  execute  only  after  all  its  predecessors 
have  completed,  all  input  data  items  have  been  received 
from  its  predecessors,  and  the  input  data  sets  have  been 
retrieved  from  one  of  the  data  repositories, 

•  Preserve  all  advance  resource  reservations, 

•  Only  one  subtask  can  execute  on  any  machine  at  any 
given  time,  and 

•  At  most  one  subtask  can  access  any  data  repository  at 
any  given  time. 

2.4.  Separated  Scheduling  Vs.  Unified  Scheduling 

Many  scheduling  methods  exist  in  the  literature  for 
scheduling  application  DAGs  on  compute  and  network  re¬ 
sources.  They  do  not  consider  data  repositories.  With  the 
inclusion  of  data  repositories,  one  can  obtain  schedules  for 
compute  resources  and  data  repositories  independently  and 


Figure  3.  Separated  scheduling  (machines 
first) 


159 


Figure  4.  Separated  scheduling  (data  reposi¬ 
tories  first) 


Figure  5.  Unified  scheduiing 


combine  the  schedules.  In  this  section  we  show  with  a  sim¬ 
ple  example,  that  this  separated  approach  is  not  efficient 
with  respect  to  completion  time. 

Figure  2  shows  the  DAG  representation  for  an  applica¬ 
tion  task  with  5  subtasks.  In  this  example,  we  assume  a  fully 
connected  system  with  3  machines  and  3  data  repositories 
(file  servers).  The  subtask  execution  times  (in  time  units) 
are  given  in  Table  1 .  Table  2  gives  the  the  cost  (in  time  units) 
for  transferring  one  data  unit  from  any  data  repository  to  any 
machine.  We  assume  that  each  subtask  needs  an  input  data 
set,  which  can  be  retrieved  from  one  or  more  data  reposito¬ 
ries  as  given  in  Table  3. 

In  this  example,  we  are  using  a  simple  list  scheduling  al¬ 
gorithm  called  the  Baseline  Algorithm.  This  algorithm  has 
been  described  in  [20,  27].  The  baseline  algorithm  is  a  fast 
static  algorithm  for  mapping  DAGs  in  HC  environments.  It 


partitions  the  subtasks  in  the  DAG  into  blocks  (levels)  us¬ 
ing  an  algorithm  similar  to  the  level  partitioning  algorithm 
which  will  be  described  in  Section  3.1.  Then  all  the  sub¬ 
tasks  are  ordered  such  that  the  subtasks  in  block  k  come  be¬ 
fore  the  subtasks  in  block  6,  where  k  <  b.  The  subtasks  in 
the  same  block  are  sorted  in  descending  order  based  on  the 
number  of  descendents  of  each  subtask  (ties  are  broken  ar¬ 
bitrarily).  The  subtasks  are  considered  for  mapping  in  this 
order.  A  subtask  is  mapped  to  the  machine  that  gives  the 
minimum  completion  time  for  that  particular  subtask.  Since 
the  original  algorithm  does  not  account  for  the  data  reposi¬ 
tories,  we  implemented  a  modified  version  of  the  algorithm. 
In  the  modified  version,  the  algorithm  chooses  a  data  repos¬ 
itory  that  gives  the  best  retrieving  time  of  the  input  data  set. 

The  schedule  based  on  the  separated  approach,  when 
scheduling  the  machines  first,  is  shown  in  Figure  3.  The 
completion  time  of  this  schedule  is  52  time  units.  For  this 
case,  we  map  the  application  subtasks  to  the  machines  as 
they  are  the  only  resources  in  the  system.  Then  for  each 
subtask  we  choose  the  data  repository  that  gives  the  best  re¬ 
trieving  (delivery)  time  of  the  input  data  set  to  the  previ¬ 
ously  mapped  machine  for  this  subtask  in  order  to  minimize 
its  completion  time.  The  completion  time  of  the  schedule 
based  on  the  separated  approach,  when  scheduling  the  data 
repositories  first,  is  39  time  units  as  shown  in  Figure  4.  For 
this  case,  we  map  the  application  subtasks  to  the  data  repos¬ 
itories  as  they  are  the  only  system  resources.  Then  for  each 
subtask  we  choose  the  machine  that  gives  the  best  comple¬ 
tion  time  for  that  subtask  when  using  the  previously  mapped 
data  repository  to  get  the  required  data  set  for  this  subtask. 
Figure  5  shows  the  schedule  based  on  the  unified  approach. 
The  completion  time  of  the  unified  scheduling  is  28.5  time 
units.  In  the  unified  approach,  we  map  each  subtask  to  a  ma¬ 
chine  and  data  repository  at  the  same  time  in  order  to  mini¬ 
mize  its  completion  time. 

The  previous  example  shows  clearly  that  the  scheduling 
based  on  the  separated  approach  is  not  efficient  with  respect 
to  completion  time.  Further,  with  advance  reservations,  sep¬ 
arated  scheduling  can  lead  to  poor  utilization  of  resources 
when  one  type  of  resource  is  not  available  while  others  are 
available. 

3.  Resource  Scheduling  Algorithms 

In  this  section,  we  develop  static  (compile-time)  heuris¬ 
tic  algorithms  for  scheduling  tasks  in  a  metacomputing  sys¬ 
tem  where  the  compute  resources  and  the  data  repositories 
have  advance  reservations.  These  resources  are  available  to 
schedule  subtasks  only  during  certain  time  intervals  as  they 
are  reserved  (by  other  users)  at  other  times.  Although  our 
framework  incorporates  the  notion  of  QoS,  the  algorithms 
we  present  in  this  paper  do  not  consider  QoS.  We  are  cur¬ 
rently  working  on  extending  our  scheduling  algorithms  to 


160 


Task  1  Task  2 


Taski 


Task  2 


Figure  6.  Combined  DAG  for  the  tasks  in  Fig.  1  Figure  7.  Level  partitioning  for  the  combined 

DAG  in  Fig.  6 


consider  QoS  requirements  such  as  deadlines,  priorities,  and 
security. 

As  in  state-of-the-art  systems,  we  assume  a  central  sched¬ 
uler  with  a  given  set  of  static  application  tasks  to  schedule. 
With  static  applications,  the  complete  set  of  task  to  be  sched¬ 
uled  is  known  a  priori.  Tasks  from  all  sites  are  sent  to  the 
central  scheduler  to  determine  the  schedule  for  each  sub¬ 
task  so  that  the  global  objective  is  achieved.  The  informa¬ 
tion  about  the  submitted  tasks  as  well  as  status  of  various 
resources  are  communicated  to  the  central  scheduler.  This 
centralized  scheduler  will  then  make  appropriate  decisions 
and  can  achieve  better  utilization  of  the  resources. 

Scheduling  in  metacomputing  systems,  even  if  we  sched¬ 
ule  based  on  compute  resources  only,  is  known  to  be  NP- 
complete.  One  method  is  based  on  the  well  known  list 
scheduling  algorithm  [1,  16,  23].  In  list  scheduling,  all  the 
subtasks  of  a  DAG  are  placed  in  a  list  according  to  some  pri¬ 
ority  assigned  to  each  subtask.  A  subtask  cannot  be  sched¬ 
uled  until  all  its  predecessors  have  been  scheduled.  Ready 
subtasks  are  considered  for  scheduling  in  order  of  their  pri¬ 
orities.  In  this  section,  we  develop  modified  versions  of 
list  scheduling  algorithm  for  our  generalized  task  scheduling 
problem  with  advance  resource  reservations.  Our  heuristic 
algorithms  that  are  based  on  the  list  scheduling  are  of  two 
types  -  level  by  level  scheduling  and  greedy  approach.  In 
the  following,  we  briefly  describe  these  two  types  of  algo¬ 
rithms. 


3.1.  Level-By-Level  Scheduling 

In  our  framework,  application  tasks  are  represented  by 
DAGs  where  a  node  is  a  subtask  and  the  edges  from  pre¬ 
decessors  represent  control  flow.  Each  subtask  has  compu¬ 
tation  cost,  data  items  to  be  communicated  from  predeces¬ 
sor  subtasks,  and  data  sets  from  one  or  more  repositories. 
A  subtask  is  ready  for  execution  if  all  its  predecessors  have 
completed,  and  it  has  received  all  the  input  data  needed  for 
its  execution.  To  facilitate  the  discussion  of  our  schedul¬ 
ing  algorithms,  a  hypothetical  node  is  created  and  linked, 
with  zero  communication  time  links,  to  the  root  nodes  of 
all  the  submitted  DAGs  to  obtain  one  combined  DAG.  This 
dummy  node  has  zero  computation  time.  Figure  6  shows  the 
combined  DAG  for  the  two  tasks  in  Figure  1.  Now,  mini¬ 
mizing  the  maximum  time  to  complete  this  combined  DAG 
achieves  our  global  objective. 

In  level-by-level  heuristic,  we  first  partition  the  com¬ 
bined  DAG  into  /  levels  of  subtasks.  Each  level  contains  in¬ 
dependent  subtasks,  i.e.,  there  are  no  dependencies  between 
the  subtasks  in  the  same  level.  Therefore,  all  the  subtasks  in 
a  level  can  be  executed  in  parallel  once  they  are  ready.  Level 
0  contains  the  dummy  node.  Level  1  contains  all  subtasks 
that  do  not  have  any  incident  edges  originaly,  i.e.,  subtasks 
without  any  predecessors  in  the  original  DAGs.  All  subtasks 
in  level  /  have  no  successors.  For  each  subtask  Vj  in  level  k, 
all  of  its  predecessors  are  in  levels  0  to  A;  —  1,  and  at  least  one 
of  them  in  level  k-l.  Figure  7  shows  the  levels  of  the  com¬ 
bined  DAG  in  Fig.  6.  The  combined  DAG  in  this  example 
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Level -by-Level  Scheduling  Algorithm 
begin 

Combine  all  submitted  DAGs  into  one  DAG. 

Do  level  partitioning  for  the  combined  DAG. 

For  level  :=  1  to  /  do 

Set  Ready  to  be  the  set  of  all  subtasks  at  this  level. 

While  Ready  is  not  empty  do 

Find  FINISH{vi,mmin,Smin)  for  all  subtasks  in  Ready,  where  rrimin  is 
the  machine  that  gives  the  minimum  completion  time  for  subtask  Vi 
if  data  repository  Smin  has  been  used  to  get  the  input  data  set. 
Min-FINISH:  Choose  the  subtask  Vk  with  the  minimum  completion  time. 
Max-FINISH:  Choose  the  subtask  Vk  with  the  maximum  completion  time. 
Schedule  subtask  Vk  to  machine  rrimin  and  data  repository  Smin  • 

Update  MA(mmin)  and  5A(smm)- 
Remove  Vk  from  Ready. 
end  While 
end  For 


Figure  8.  Pseudo  code  for  the  level-by-level  scheduling  algorithms 


has  4  levels. 

The  scheduler  considers  subtasks  in  each  level  at  a  time. 
Among  the  subtasks  in  a  particular  level  i,  the  subtask  with 
the  minimum  completion  time  will  be  scheduled  first  in  the 
Min-FINISH  algorithm  and  the  subtask  with  the  maximum 
completion  time  is  scheduled  first  in  the  Max-FINISH  algo¬ 
rithm.  The  advance  reservations  of  compute  resources  and 
data  repositories  are  handled  by  choosing  the  first-fit  time 
interval  to  optimize  the  completion  time  of  a  subtask. 

The  idea  behind  the  Min-FINISH  algorithm,  as  in  algo¬ 
rithm  D  in  [14]  and  Min-min  algorithm  in  SmartNet  [11],  is 
that  at  each  step,  we  attempt  to  minimize  the  finish  time  of 
the  last  subtask  in  the  ready  set.  On  the  other  hand,  the  idea 
in  the  Max-FINISH,  as  in  algorithm  E  in  [14]  and  Max-min 
algorithm  in  SmartNet  [11],  is  to  minimize  the  worst  case 
finishing  time  for  critical  subtasks  by  giving  them  the  op¬ 
portunity  to  be  mapped  to  their  best  resources.  The  pseudo 
code  for  the  level-by-level  scheduling  algorithms  is  shown 
in  Figure  8. 

3.2.  Greedy  Approach 

Since  the  subtasks  in  a  specific  level  i  of  the  combined 
DAG  belong  to  different  independent  tasks,  by  scheduling 
level  by  level  we  are  creating  dependency  among  various 
tasks.  Further,  the  completion  times  of  levels  of  different 
tasks  can  vary  widely,  and  the  level-by-level  scheduling  al¬ 
gorithms  may  not  perform  well.  The  idea  in  the  greedy 
heuristics,  Min-FINISH-ALL  and  Max-FINISH-ALL,  is  to 
consider  subtasks  in  all  the  levels  that  are  ready  to  execute 


in  determining  their  schedule.  This  will  advance  execu¬ 
tion  of  different  tasks  by  different  amounts  and  will  attempt 
to  achieve  the  global  objective  and  provide  good  response 
times  for  short  tasks  at  the  same  time.  As  before,  we  con¬ 
sider  both  the  minimum  finish  time  and  the  maximum  fin¬ 
ish  time  of  all  ready  subtasks  in  determining  the  order  of  the 
subtasks  to  schedule. 

The  two  greedy  algorithms,  Min-FINISH-ALL  and  Max- 
FINISH-ALL  algorithm,  are  similar  to  Min-FINISH  and 
Max-FINISH  respectively.  They  only  differ  with  respect  to 
the  Ready  set.  In  the  greedy  algorithms,  the  Ready  set  may 
contain  subtasks  from  several  levels.  Initially,  the a c?y  set 
contains  all  subtasks  at  level  i  from  all  applications.  After 
mapping  a  subtask,  the  algorithms  check  if  any  of  its  succes¬ 
sors  are  ready  to  be  considered  for  scheduling  and  add  them 
to  Ready  set.  A  subtask  cannot  be  considered  for  schedul¬ 
ing  until  all  its  predecessors  have  been  scheduled. 

4.  Results  and  Discussion 

For  the  generalized  resource  scheduling  problem  consid¬ 
ered  above,  it  is  not  clear  which  variation  of  the  list  schedul¬ 
ing  will  perform  best.  Our  intuition  is  that  scheduling  sub¬ 
tasks  by  considering  all  resource  types  together  will  result  in 
bounded  suboptimal  solutions.  In  order  to  evaluate  the  ef¬ 
fectiveness  of  the  scheduling  algorithms  discussed  in  Sec¬ 
tions  3.1  and  3.2,  we  have  developed  a  software  simulator 
that  calculates  the  completion  time  for  each  of  them.  The  in¬ 
put  parameters  are  given  to  the  simulator  as  fixed  values  or 
as  a  range  of  values  with  a  minimum  and  maximum  value. 
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Figure  9.  Simulation  results  for  20  machines  and  6  data  repositories  with  varying  number  of  subtasks 
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Figure  10.  Simulation  results  for  50  subtasks  with  varying  number  of  machines  and  data  repositories 
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Subtask  execution  times,  communication  latencies,  commu¬ 
nication  transfer  rates,  data  items  amounts,  and  data  sets 
amounts,  are  specified  to  the  simulator  as  range  of  values. 
The  actual  values  of  these  parameters  are  choosen  randomly 
by  the  simulator  within  the  specified  ranges.  The  fixed  input 
parameters  are  the  number  of  machines,  the  number  of  data 
repositories,  the  number  of  data  items,  and  the  total  number 
of  subtasks. 

We  assume  that  each  task  needs  an  input  data  set  from  the 
data  repositories.  This  data  set  can  be  replicated  and  may 
be  retreived  from  one  or  more  data  repositories.  Each  com¬ 
pute  resource  and  data  repository  had  several  slots  blocked 
at  the  beginning  of  the  simulation  to  indicate  advance  reser¬ 
vations.  We  compare  our  scheduling  algorithms  with  sep¬ 
arated  version  of  the  baseline  algorithm  discussed  in  Sec¬ 
tion  2.4.  The  simulation  results  are  shown  in  Figures  9 
and  10.  In  Figure  9,  the  scheduling  algorithms  are  com¬ 
pared  for  varying  number  of  subtasks  using  20  machines  and 
6  data  repositories.  Figure  10  shows  a  similar  comparison 
for  varying  number  of  machines  and  data  repositories  with 
50  subtasks.  Our  preliminary  results  show  that  all  four  of 
our  heuristic  algorithms  seem  to  have  similar  performance 
with  relatively  uniform  task  costs.  The  simulation  results 
clearly  show  that  it  is  advantageous  to  schedule  the  system 
resources  in  a  unified  manner  rather  than  scheduling  each 
type  of  resource  separately.  Our  scheduling  algorithms  have 
at  least  30%  improvment  over  the  baseline  algorithm  which 
use  the  separated  approach. 

5.  Future  Work 

This  work  represents,  to  the  best  of  our  knowledge,  the 
first  step  towards  a  unified  framework  for  resource  schedul¬ 
ing  with  emerging  constraints  that  are  important  in  meta¬ 
computing.  In  this  paper,  we  have  considered  one  such  re¬ 
quirement  of  advance  reservations  for  compute  resources 
and  data  repositories  in  this  paper.  We  are  investigating  the 
question  of  how  advance  reservations  impact  task  comple¬ 
tion  times.  That  is,  in  the  scheduling,  how  soon  we  want 
to  reserve  a  resource  for  a  subtask  to  avoid  waiting  for  an¬ 
other  resource  and/or  blocking  a  different  subtask.  We  are 
currently  working  on  extending  our  scheduling  algorithms 
to  consider  QoS  requirements  such  as  deadlines,  priorities, 
and  security.  We  are  investigating  the  mapping  of  QoS  spec¬ 
ified  at  task  level  to  subtasks  in  our  framework. 

In  our  future  work  we  plan  to  develop  scheduling  algo¬ 
rithms  for  dynamic  environments  with  the  above  mentioned 
resource  requirements.  In  a  dynamic  environment,  appli¬ 
cation  tasks  arrive  in  a  real-time  non-deterministic  man¬ 
ner.  System  resources  may  be  removed,  or  new  resources 
may  be  added  during  run-time.  Dynamic  scheduling  algo¬ 
rithms  make  use  of  real-time  information  and  require  feed¬ 
back  from  the  system. 
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Abstract 

The  MILAN  project,  a  joint  effort  involving  Arizona 
State  University  and  New  York  University,  has  produced 
and  validated  fundamental  techniques  for  the  realization 
of  efficient,  reliable,  predictable  virtual  machines,  that 
is,  metacomputers,  on  top  of  environments  that  consist  of 
an  unreliable  and  dynamically  changing  set  of  machines. 
In  addition  to  the  techniques,  the  principal  outcomes  of 
the  project  include  three  parallel  programming  systems — 
Calypso,  Chime,  and  Charlotte — which  enable  applications 
be  developed  for  ideal,  shared  memory,  parallel  machines 
to  execute  on  distributed  platforms  that  are  subject  to  fail¬ 
ures,  slowdowns,  and  changing  resource  availability.  The 
lessons  learned  from  the  MILAN  project  are  being  used  to 
design  Computing  Communities,  a  metacomputing  frame¬ 
work  for  general  computations. 

1.  Motivation 

MILAN  (Metacomputing  In  Large  Asynchronous 
Networks)  is  a  joint  project  of  Arizona  State  University  and 
New  York  University.  The  primary  objective  of  the  MILAN 
project  is  to  provide  middleware  layers  that  would  enable 
the  efficient,  predictable  execution  of  applications  on  an  un¬ 
reliable  and  dynamically  changing  set  of  machines.  Such  a 
middleware  layer,  will  in  effect  create  a  metacomputer,  that 
is  a  reliable  stable  platform  for  the  execution  of  applica¬ 
tions. 

Improvements  in  networking  hardware,  communication 
software,  distributed  shared  memory  techniques,  program¬ 
ming  languages  and  their  implementations  have  made  it  fea¬ 
sible  to  employ  distributed  collections  of  computers  for  ex¬ 
ecuting  a  wide  range  of  parallel  applications.  These  “meta¬ 
computing  environments,”  built  from  commodity  machine 
nodes  and  connected  using  commodity  interconnects,  afford 
significant  cost  advantages  in  addition  to  their  widespread 
availability  (e.g.,  a  machine  on  every  desktop  in  an  organi¬ 
zation).  However,  such  environments  also  present  unique 
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challenges  for  constructing  metacomputers  on  them,  be¬ 
cause  the  component  machines  and  networks  may:  (1)  ex¬ 
hibit  wide  variations  in  performance  and  capacity,  (2)  be¬ 
come  unavailable  either  partially  or  completely  because 
of  their  use  for  other  (non-metacomputing  related)  tasks. 
These  challenges  force  parallel  applications  running  on 
metacomputers  to  deal  with  an  unreliable,  dynamically 
changing  set  of  machines  and  have  thus,  limited  their  use 
on  all  but  the  most  decoupled  of  parallel  computations. 

As  part  of  the  MILAN  project,  we  have  been  investigat¬ 
ing  fundamental  techniques  which  would  enable  the  effec¬ 
tive  use  of  metacomputing  environments  for  a  wide  class  of 
applications,  originally  concentrating  on  parallel  ones.  The 
key  thrust  of  the  project  has  been  to  develop  run-time  mid¬ 
dleware  that  builds  an  efficient,  predictable,  reliable  virtual 
machine  model  on  top  of  unreliable  and  dynamically  chang¬ 
ing  platforms.  Such  a  virtual  machine  model  would  enable 
applications  developed  for  idealized,  reliable,  homogeneous 
parallel  machines  to  run  unchanged  on  unreliable,  hetero¬ 
geneous  metacomputing  environments.  Figure  1  shows  the 
MILAN  middleware  in  context.  Our  approach  for  realizing 
the  virtual  machine  takes  advantage  of  two  general  charac¬ 
teristics  of  computation  behavior:  adaptivity  and  tunability. 

Adaptivity  refers  to  a  flexibility  in  execution.  Specif¬ 
ically,  a  computation  is  adaptive  if  it  exhibits  at  least 
one  of  these  two  properties:  (1)  it  can  statically  (at 
start  time)  and/or  dynamically  (during  the  execution) 
ask  for  resources  satisfying  certain  characteristics  and 
incorporate  such  resources  when  they  are  given  to  it, 
and  (2)  it  can  continue  executing  even  when  some  re¬ 
sources  are  taken  away  from  it. 

Tlinability  refers  to  a  flexibility  in  specification. 
Specifically,  a  computation  is  tunable  if  it  is  able  to 
trade  off  resource  requirements  over  its  lifetime,  com¬ 
pensating  for  a  smaller  allocation  of  resources  in  one 
stage  with  a  larger  allocation  in  another  stage  and/or  a 
change  in  the  quality  of  output  produced  by  the  com¬ 
putation. 

Our  techniques  leverage  this  flexibility  in  execution  and 
specification  to  provide  reliability,  load  balancing,  and  pre- 
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Figure  1.  The  MILAN  middleware  in  context. 


dictability,  even  when  the  underlying  set  of  machines  is  un¬ 
reliable  and  changing  dynamically. 

The  principal  outcomes  of  the  MILAN  project  are  (1) 
a  core  set  of  fundamental  resource  management  tech¬ 
niques  [21,3, 23, 9]  enabling  construction  of  efficient,  reli¬ 
able,  predictable  virtual  machines,  and  (2)  the  realization  of 
these  techniques  in  three  complete  programming  systems: 
Calypso  [3],  Chime  [27],  and  Charlotte  [6].  Calypso  ex¬ 
tends  C++  with  parallel  steps  interleaved  into  a  sequential 
program.  Each  parallel  step  specifies  the  independent  exe¬ 
cution  of  multiple  concurrent  tasks  or  a  family  of  such  tasks. 
Chime  extends  Calypso  to  provide  nested  parallel  steps  and 
inter-thread  communication  primitives  (as  expressed  in  the 
shared  memory  parallel  language.  Compositional  C++  [8]). 
Charlotte  provides  a  Calypso-like  programming  system  and 
runtime  environment  for  the  Web.  In  addition  to  these  sys¬ 
tems,  the  MILAN  project  has  also  produced  two  general 
tools:  ResourceBroker  [4]  and  Knitting  Factory  [5],  which 
support  resource  discovery  and  integration  in  distributed 
and  web-based  environments,  respectively.  More  recently, 
as  part  of  the  Computing  Communities  project,  we  have 
been  examining  how  the  experience  gained  from  design¬ 
ing,  implementing,  and  evaluating  these  systems  can  be  ex¬ 
tended  to  supporting  general  applications  on  metacomput¬ 
ing  environments. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
overviews  the  fundamental  techniques  central  to  all  of  the 
MILAN  project’s  programming  systems.  The  design,  im¬ 
plementation,  and  performance  of  the  various  programming 
systems  and  general  tools  is  described  in  detail  in  Section  3. 
Finally,  Section  4  presents  the  rationale  and  preliminary  de¬ 
sign  of  Computing  Communities,  a  metacomputing  frame¬ 
work  for  general  computations. 

2  Key  Techniques 

To  execute  parallel  programs  on  networks  of  commodity 
machines,  one  frequently  assumes  a  priori  knowledge — at 
program  development — of  the  number,  relative  speeds,  and 
the  reliability  of  the  machines  involved  in  the  computation. 
Using  this  information,  the  program  can  then  distribute  its 


load  evenly  for  efficient  execution.  This  knowledge  can 
not  be  assumed  for  distributed  multiuser  environments,  and 
hence,  it  is  imperative  that  programs  adapt  to  machine  avail¬ 
ability.  That  is,  a  program  running  on  a  metacomputer  must 
be  able  to  integrate  new  machines  into  a  running  compu¬ 
tation,  mask  and  remove  failed  machines,  and  balance  the 
work  load  in  such  a  way  that  slow  machines  do  not  dictate 
the  progress  of  the  computation. 

The  traditional  solution  to  overcome,  this  type  of  dy¬ 
namically  changing  environment  has  been  to  write  self- 
scheduling  parallel  programs  (also  referred  to  as  the  mas¬ 
ter/slave  [16],  the  manager/worker  [17],  or  the  bag-of- 
tasks  [7]  programming  model).  In  self-scheduled  programs, 
the  computation  is  divided  into  a  large  number  of  small 
computational  units,  or  tasks.  Participating  machines  pick 
up  and  execute  a  task,  one  at  a  time,  until  all  tasks  are  done, 
enabling  the  computation  to  progress  at  a  rate  proportional 
to  available  resources.  However,  self  scheduling  does  not 
solve  all  the  problems  associated  with  executing  programs 
on  distributed  multiuser  environments.  First,  self  schedul¬ 
ing  does  not  address  machine  and  network  failures.  Second, 
a  very  slow  machine  can  slow  down  the  progress  of  faster 
machines  if  it  picks  up  a  compute-intensive  task.  Finally, 
self  scheduling  increases  the  number  of  tasks  comprising  a 
computation  and,  thereby,  increases  the  effects  of  the  over¬ 
head  associated  with  assigning  tasks  to  machines.  Depend¬ 
ing  on  the  network,  this  overhead  may  be  large  and,  in  many 
cases,  unpredictable. 

The  MILAN  project  extends  the  basic  self-scheduling 
scheme  in  various  ways  to  adequately  address  the  above 
shortcomings.  These  extensions  are  embodied  in  five  tech¬ 
niques:  eager  scheduling,  two-phase  idempotent  execution 
strategy  (TIES),  dynamic  granularity  management,  pre¬ 
emptive  scheduling,  and  predictable  scheduling  for  tunable 
computations.  We  describe  the  principal  ideas  behind  each 
of  these  techniques  here,  deferring  a  detailed  discussion  of 
their  implementation  and  impact  on  performance  to  the  de¬ 
scription  of  various  programming  systems  in  Section  3. 
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2.1  Eager  Scheduling 

Eager  scheduling  extends  self  scheduling  to  deal  with 
network  and  machine  failures,  as  well  as  any  disparity  in 
machine  speeds.  The  key  idea  behind  eager  scheduling,  ini¬ 
tially  proposed  in  [21],  is  that  a  single  computation  task  can 
be  concurrently  worked  upon  by  multiple  machines.  Ea¬ 
ger  scheduling  works  in  a  manner  similar  to  self  schedul¬ 
ing  at  the  beginning  of  a  parallel  step,  but  once  the  num¬ 
ber  of  remaining  tasks  goes  below  the  number  of  avail¬ 
able  machines,  eager  scheduling  aggressively  assigns  and 
re-assigns  tasks  until  all  tasks  have  been  executed  to  com¬ 
pletion.  Concurrent  assignment  of  tasks  to  multiple  ma¬ 
chines  guarantees  that  slow  machines,  even  very  slow  ma¬ 
chines,  do  not  slow  down  the  computation.  Furthermore,  by 
considering  failure  as  a  special  case  of  a  slow  machine  (an 
infinitely  slow  machine),  even  if  machines  crash  or  become 
less  accessible,  for  example  due  to  network  delays,  the  en¬ 
tire  computation  will  finish  as  long  as  at  least  one  machine 
is  available  for  a  sufficiently  long  period  of  time.  Thus,  ea¬ 
ger  scheduling  masks  machine  failures  without  the  need  to 
actually  detect  failures. 

2.2  Two-phase  Idempotent  Execution  Strategy 
(TIES) 

Multiple  executions  of  a  program  fragment  (which  is 
possible  under  eager  scheduling)  can  result  in  an  incorrect 
program  state.  TIES  [21]  ensures  idempotent  memory  se¬ 
mantics  in  the  presence  of  multiple  executions.  The  com¬ 
putation  of  each  parallel  step  is  divided  into  two  phases. 
In  the  first  phase,  modifications  to  the  shared  data  region, 
that  is  the  write-set  of  tasks,  are  computed  but  kept  aside 
in  a  buffer.  The  second  phase  begins  when  all  tasks  have 
executed  to  completion.  Then,  a  single  write-set  for  each 
completed  task  is  applied  to  the  shared  data,  thus  atomi¬ 
cally  updating  the  memory.  Note  that  each  phase  is  idem- 
potent,  since  its  inputs  and  outputs  are  disjoint.  Informally, 
in  the  first  phase  the  input  is  shared  data  and  the  output  is 
the  buffer,  and  in  the  second  phase  the  input  is  the  buffer 
and  the  output  is  shared  memory. 

2.3  Dynamic  Granularity  Management 

The  interplay  of  eager  scheduling  and  TIES  addresses 
fault  masking  and  load  balancing.  Dynamic  granularity 
management  {bunching  for  short)  is  used  to  amortize  over¬ 
heads  and  mask  network  latencies  associated  with  the  pro¬ 
cess  of  assigning  tasks  to  machines.  Bunching  extends  self 
scheduling  by  assigning  a  set  of  tasks  (a  bunch)  as  “a  single 
assignment.”  Bunching  has  three  benefits.  First,  it  reduces 
the  number  of  task  assignments,  and  hence,  the  associated 
overhead.  Second,  it  overlaps  computation  with  commu¬ 
nication  by  allowing  machines  to  execute  the  next  task  (of 
a  bunch)  while  the  results  of  the  previous  task  are  being 


sent  back  on  the  network.  Finally,  bunching  allows  the  pro¬ 
grammer  to  write  fine-grained  parallel  programs  that  are  au¬ 
tomatically  and  transparently  executed  in  a  coarse-grained 
manner. 

We  have  implemented [19],  an  algorithm  that 
computes  the  bunch  size  based  on  the  number  of  remaining 
tasks  and  the  number  of  currently  available  machines. 

2.4  Preemptive  Scheduling 

Eager  scheduling  provides  load  balancing  and  fault  iso¬ 
lation  in  a  dynamic  environment.  However,  our  description 
so  far  has  considered  only  non-preemptive  tasks  which  run 
to  completion  once  assigned  to  a  worker.  Non-preemptive 
scheduling  has  the  disadvantage  of  delivering  sub-optimal 
performance  when  there  is  a  mismatch  between  the  set  of 
tasks  and  the  set  of  machines.  Examples  of  situations  in¬ 
clude  when  the  number  of  tasks  is  not  divisible  by  the  num¬ 
ber  of  machines,  when  the  tasks  are  of  unequal  lengths,  and 
when  the  number  of  tasks  is  not  static  (i.e.,  new  tasks  are 
created  and/or  terminated  on  the  fly).  To  address  ineffi¬ 
ciencies  resulting  from  these  situations,  the  MILAN  project 
complements  eager  scheduling  with  preemptive  scheduling 
techniques.  Our  results,  discussed  in  Section  3,  show  that 
despite  preemption  overheads,  use  of  preemptive  schedul¬ 
ing  on  distributed  platforms  can  improve  execution  time  of 
parallel  programs  by  reducing  the  number  of  tasks  that  need 
to  be  repeatedly  executed  by  eager  scheduling  [23]. 

We  have  developed  a  family  of  preemptive  algorithms, 
of  which  we  present  three  here.  The  Optimal  Algorithm 
is  targeted  for  situations  where  the  number  of  tasks  to  be 
executed  is  slightly  larger  than  the  number  of  machines 
available.  This  algorithm  precomputes  a  schedule  that 
minimizes  the  execution  time  and  ^e  number  of  context 
switches  needed.  However  it  requires  that  the  task  execu¬ 
tion  time  be  known  in  advance  and  therefore  is  not  always 
practical.  The  Distributed,  Fault-tolerant  Round  Robin  Al¬ 
gorithm  is  suited  for  a  set  of  n  tasks  scheduled  on  m  ma¬ 
chines,  where  n  >  m.  Initially,  the  first  m  tasks  are  as¬ 
signed  to  the  m  machines.  Then,  after  a  specified  time 
quantum,  all  the  tasks  are  preempted  and  the  next  m  tasks 
are  assigned.  This  continues  in  a  circular  fashion  until  all 
tasks  are  completed.  The  Preemptive  Task  Bunching  Algo¬ 
rithm  is  applicable  over  a  wider  range  of  circumstances.  All 
n  tasks  are  bunched  into  m  bunches  and  assigned  to  the 
m  machines.  When  a  machine  finishes  its  assigned  bunch, 
all  the  tasks  on  all  the  other  machines  are  pre-empted  and 
all  the  remaining  tasks  are  collected,  re-bunched  (into  m 
sets),  and  assigned  again.  This  algorithm  works  well  for 
both  large-grained  and  fine-grained  tasks  even  when  ma¬ 
chine  speeds  and  task  lengths  vary. 
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2.5  Predictable  Scheduling 

While  the  techniques  described  earlier  enable  the  build¬ 
ing  of  an  efficient,  fault-tolerant  virtual  machine  on  top  of 
an  unreliable  and  dynamically  changing  set  of  machines, 
they  alone  are  unable  to  address  the  predictability  require¬ 
ments  of  applications  such  as  image  recognition,  virtual  re¬ 
ality,  and  media  processing  that  are  increasingly  running  on 
metacomputers.  One  of  the  key  challenges  deals  with  pro¬ 
viding  sufficient  resources  to  computations  to  enable  them 
to  meet  their  time  deadlines  in  the  face  of  changing  resource 
availability. 

Our  technique  [9]  relies  upon  an  explicit  specification  of 
application  tunability,  which  refers  to  an  application’s  abil¬ 
ity  to  absorb  and  relinquish  resources  during  its  lifetime, 
possibly  trading  off  resource  requirements  versus  quality 
of  its  output.  Tunability  provides  the  freedom  of  choosing 
amongst  multiple  execution  paths,  each  with  their  own  re¬ 
source  allocation  profile.  Given  such  a  specification  and 
short-term  knowledge  about  the  availability  of  resources, 
the  MILAN  resource  manager  chooses  an  appropriate  ex¬ 
ecution  path  in  the  computation  that  would  allow  the  com¬ 
putation  to  meet  its  predictability  requirements.  In  general, 
the  resource  manager  will  need  to  renegotiate  both  the  level 
of  resource  allocation  and  the  choice  of  execution  path  in 
response  to  changes  in  resource  characteristics.  Thus,  ap¬ 
plication  tunability  increases  its  likelihood  of  achieving  pre¬ 
dictable  behavior  in  a  dynamic  environment. 

3.  Programming  Systems 

We  describe,  in  turn,  three  programming  systems — 
Calypso,  Chime,  and  Charlotte — which  provide  the  con¬ 
crete  context  in  which  the  techniques  described  earlier  have 
been  implemented  and  evaluated.  We  then  discuss  the  de¬ 
sign  and  implementation  of  the  ResourceBroker,  a  tool  for 
dynamic  resource  association.  KnittingFactory  provides  the 
same  functionality  as  ResourceBroker,  albeit  for  Web  meta¬ 
computing  environments,  and  is  not  being  discussed  be¬ 
cause  of  space  considerations. 

3.1  Calypso 

Commercial  realities  dictate  that  parallel  computations 
typically  will  not  be  given  a  dedicated  set  of  identical  ma¬ 
chines.  Non-dedicated  computing  platforms  suffer  from 
non-uniform  processing  speeds,  unpredictable  behavior, 
and  transient  availability.  These  characteristics  result  from 
external  factors  that  exist  in  “real”  networks  of  machines. 
Unfortunately,  load  balancing,  fault  masking,  and  adaptive 
execution  of  programs  on  a  set  of  dynamically  changing 
machines  are  neglected  by  most  programming  systems.  The 
neglect  of  these  issues  has  complicated  the  already  difficult 
job  of  developing  parallel  programs. 


Calypso  [3]  is  a  parallel  programming  system  and  a  run¬ 
time  system  designed  for  adaptive  parallel  computing  on 
networks  of  machines.  The  work  on  Calypso  has  resulted 
in  several  original  contributions  which  are  summarized  be¬ 
low. 

Calypso  separates  the  programming  model  from  the  exe¬ 
cution  environment:  programs  are  written  for  a  reliable  vir¬ 
tual  shared-memory  computer  with  unbounded  number  of 
processors,  i.e.,  a  metacomputer,  but  execute  on  a  network 
of  dynamically  changing  machines.  This  presents  the  pro¬ 
grammer  with  the  illusion  of  a  reliable  machine  for  program 
development  and  verification.  Furthermore,  the  separation 
allows  programs  to  be  parallelized  based  on  the  inherent 
properties  of  the  problem  they  solve,  rather  than  the  execu¬ 
tion  environment. 

Programs  without  any  modifications  can  execute  on  a 
single  machine,  a  multiprocessor,  or  a  network  of  unreliable 
machines.  The  Calypso  runtime  system  is  able  to  adapt  ex¬ 
ecuting  programs  to  use  available  resources — computations 
can  dynamically  scale  up  or  down  as  machines  become 
available,  or  unavailable.  It  uses  TIES  and  allows  parts  of 
a  computation  executing  on  remote  machines  to  fail,  and 
possibly  recover,  at  any  point  without  affecting  the  correct¬ 
ness  of  the  computation.  Unlike  other  fault-tolerant  sys¬ 
tems,  there  is  no  significant  additional  overhead  associated 
with  this  feature. 

Calypso  automatically  distributes  the  work-load  depend¬ 
ing  on  the  dynamics  of  participating  machines,  using  ea¬ 
ger  scheduling  and  bunching.  The  result  is  that  fine-grain 
computations  are  efficiently  executed  in  coarse-grain  fash¬ 
ion,  and  faster  machines  perform  more  of  the  computation 
than  slower  machines.  Not  only  is  there  no  cost  associated 
with  this  feature,  but  it  actually  speeds  up  the  computation, 
because  fast  machines  are  never  blocked  while  waiting  for 
slower  machines  to  finish  their  work  assignments — they  by¬ 
pass  the  slower  machines.  As  a  consequence,  the  use  of 
slow  machines  will  never  be  detrimental  to  the  performance 
of  a  parallel  program. 

3.1.1  Calypso  Programs 

A  Calypso  program  basically  consists  of  the  standard 
C-I-+  programming  language,  augmented  by  four  additional 
keywords  to  express  parallelism.  Parallelism  is  obtained  by 
embedding  parallel  steps  within  sequential  programs.  Par¬ 
allel  steps  consist  of  one  or  more  task  (referred  to  as  jobs  in 
the  Calypso  context),  which  (logically)  execute  in  parallel 
and  are  generally  responsible  for  computationally  intensive 
segments  of  the  program.  The  sequential  parts  of  programs 
are  referred  to  as  sequential  steps  and  they  generally  per¬ 
form  initialization,  input/output,  user  interactions,  etc. 

Figure  2  illustrates  the  execution  of  a  program  with  two 
parallel  steps  and  three  sequential  steps.  It  is  important  to 
note  that  parallel  programs  are  written  for  a  virtual  shared- 
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sequential  step 


Figure  2.  An  execution  of  a  program  with  two 
parailel  steps  and  three  sequential  steps;  the 
first  paraliei  step  consists  of  9  jobs,  the  sec¬ 
ond  parailel  step  consists  of  6  jobs. 


memory  parallel  machine  irrespective  of  the  number  of  ma¬ 
chines  that  participate  in  a  given  execution. 

This  programming  model  is  sometimes  referred  to  as 
a  block-structured  parbegin/parend  or  fork/join  model  [13, 
25].  Unlike  other  programming  models  where  programs  are 
decomposed  (into  several  files  or  functions)  for  parallel  ex¬ 
ecution,  this  model  together  with  shared  memory  seman¬ 
tics,  allows  loop-level  parallelization.  As  a  result,  given  a 
working  sequential  program  it  is  fairly  straightforward  to 
parallelize  individual  independent  loops  in  an  incremental 
fashion — if  the  semantics  allows  this. 

Shared-memory  semantics  is  only  provided  for  shared 
variables,  i.e.,  variables  that  are  tagged  with  the  shared 
keyword.  A  parallel  step  starts  with  the  keyword 
parbegin  and  ends  with  the  keyword  parend.  Within 
a  parallel  step,  multiple  parallel  jobs  may  be  defined  using 
the  keyword  routine.  Completion  of  a  parallel  step  con¬ 
sists  of  completion  of  all  its  jobs  in  an  indeterminate  order. 

3.1.2  Execution  Overview 


child  process  that  executes  as  a  worker. 

The  manager  is  responsible  for  the  management  of  the 
computation  as  well  as  the  execution  of  sequential  steps. 
The  current  Calypso  implementation  only  allows  one  man¬ 
ager,  and  therefore  it  does  not  tolerate  the  failure  of  this  pro¬ 
cess.  The  computation  of  parallel  jobs  is  left  to  the  workers. 
In  general,  the  number  of  workers  and  the  resources  they 
can  devote  to  parallel  computations  can  dynamically  change 
in  a  completely  arbitrary  manner,  and  the  program  adapts  to 
the  available  machines.  In  fact,  the  arbitrary  slowdown  of 
workers  due  to  other  executing  programs  on  the  same  ma¬ 
chine,  failures  due  to  process  and  machine  crashes,  and  net¬ 
work  inaccessibility  due  to  network  partitions  are  tolerated. 
Furthermore,  workers  can  be  added  at  any  time  to  speed  up 
an  already  executing  system  and  to  increase  fault  tolerance. 
Arbitrary  slowdown  of  the  manager  is  also  tolerated;  this 
would,  of  course,  slow  down  the  overall  execution  though. 

3.1.3  Manager  Process 

The  manager  is  responsible  executing  the  non-parallel 
step  of  a  computation  as  well  as  providing  workers  with 
scheduling  and  memory  services. 

Scheduling  Service:  Jobs  are  assigned  to  workers  based 
on  a  self-scheduling  policy.  Moreover,  the  manager  has  the 
option  of  assigning  a  job  repeatedly  until  it  is  executed  to 
completion  by  at  least  one  worker — this  is  eager  scheduling, 
and  provides  the  following  benefits: 

•  As  long  as  at  least  one  worker  does  not  fail  continually, 
all  jobs  will  be  completed,  if  necessary,  by  this  one 
worker. 

•  jobs  assigned  to  workers  that  later  failed  are  automat¬ 
ically  reassigned  to  other  workers;  thus  crash  and  net¬ 
work  failures  are  tolerated. 

•  Because  workers  on  fast  machines  can  re-execute  jobs 
that  were  assigned  to  slow  machines,  they  can  bypass 
a  slow  worker  to  avoid  delaying  the  progress  of  the 
program. 

In  addition  to  eager  scheduling.  Calypso’s  scheduling 
service  implements  several  other  scheduling  techniques  for 
improved  performance.  Bunching  masks  network  latencies 
associated  with  the  process  of  assigning  jobs  to  workers.lt 
is  implemented  by  sending  the  worker  a  range  of  job  Ids 
in  each  assignment.  The  overhead  associated  with  this  im¬ 
plementation  is  one  extra  integer  value  per  job  assignment 
message,  which  is  negligible. 


A  typical  execution  of  a  Calypso  program  consists  of  a 
central  process,  called  the  manager,  and  one  or  more  addi¬ 
tional  processes,  called  workers.  These  processes  can  reside 
on  a  single  machine  or  they  can  be  distributed  on  a  network. 
In  particular,  when  a  user  starts  a  Calypso  program,  in  real¬ 
ity,  she  is  starting  a  manager.  Managers  immediately  fork  a 


Memory  Service:  Since  multiple  executions  of  jobs  caused 
by  eager  scheduling  may  lead  to  an  inconsistent  memory 
state,  managers  implements  TIES  as  follows.  Before  each 
parallel  step,  a  manager  creates  a  twin  copy  of  the  shared 
pages  and  unprotects  the  shared  region.  The  memory  man¬ 
agement  service  then  waits  until  a  worker  either  requests  a 
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page  or  reports  the  completion  of  a  job.  The  manager  uses 
the  twin  copy  of  the  shared  pages  to  service  worker  page  re¬ 
quests.  The  message  that  workers  send  to  the  manager  to  re¬ 
port  the  completion  of  a  job  also  contains  the  modifications 
that  resulted  from  executing  the  job.  Specifically,  workers 
logically  bit-wise  XORs  the  modified  shared  pages  before 
and  after  executing  the  job,  and  send  the  results  (diffs)  to 
the  manager.  When  a  manager  receives  such  a  message,  it 
first  checks  whether  the  job  has  been  completed  by  another 
worker.  If  so,  the  diffs  are  discarded,  otherwise,  the  diffs  are 
applied  (by  an  XOR  operation)  to  manager’s  memory  space. 
Notice  that  the  twin  copies  of  the  shared  pages,  which  are 
used  to  service  worker  page  requests,  are  not  modified.  The 
memory  management  of  a  parallel  step  halts  once  all  the 
jobs  have  run  to  completion,  and  the  program  execution 
then  continues  with  the  next  sequential  step. 

3.1.4  Worker  Process 

A  worker  repeatedly  contacts  the  manager  for  jobs  to 
execute.  The  manager  sends  the  worker  an  assignment  (a 
bunch  of  jobs)  specified  by  the  following  parameters:  the 
address  of  the  function,  the  number  of  instances  of  the  job, 
and  a  range  of  job  Ids.  After  receiving  a  work  assignment,  a 
worker  first  access-protects  the  shared  pages,  and  then  calls 
the  function  that  represents  an  assigned  job.  The  worker 
handles  page-faults  by  fetching  the  appropriate  page  from 
the  manager,  installing  process’  address  space,  and  unpro¬ 
tecting  the  page  so  that  subsequent  accesses  to  the  same 
page  will  proceed  undisturbed.  Once  the  execution  of  the 
function  (i.e.  the  job)  completes,  the  worker  identifies  all 
the  modified  shared  pages  and  sends  the  diffs  to  the  man¬ 
ager  and  starts  executing  the  next  job  in  the  assignment. 
Notice  that  bunching  overlaps  computation  with  commu¬ 
nication  by  allowing  a  worker  to  execute  the  next  job  while 
the  diffs  are  on  the  network  heading  to  the  manager. 

Additional  optimizations  have  been  implemented,  in¬ 
cluding  the  following: 

Caching:  For  each  shared  page,  the  manager  keeps  track  of 
the  logical  step-number  in  which  the  page  was  last  mod¬ 
ified.  This  vector  is  piggybacked  on  a  job  assignment 
the  first  time  a  worker  is  assigned  a  job  in  a  new  parallel 
step.  Hence,  the  associated  network  overhead  is  negligible. 
Workers  use  this  vector  on  page-faults  to  locally  determine 
whether  the  cached  copy  of  a  page  is  still  valid.  Thus,  pages 
that  have  paged-in  by  workers  are  kept  valid  as  long  as  pos¬ 
sible  without  a  need  for  an  invalidation  protocol.  Modified 
shared  pages  are  re-fetched  only  when  necessary.  Further¬ 
more,  read-only  shared  pages  are  fetched  by  a  worker  at 
most  once  and  write-only  shared  pages  are  never  fetched. 
As  a  result,  programmer  does  not  declare  the  type  of  coher¬ 
ence  or  caching  technique  to  use,  rather,  the  system  dynami¬ 
cally  adapts.  Invalidation  requests  are  piggybacked  on  work 
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Figure  3.  Parallel  ray  tracing  with  different 
number  of  parallel  tasks. 


assignment  messages  and  bear  very  little  additional  cost. 

Prefetching:  Prefetching  refers  to  obtaining  a  portion  of 
the  data  before  it  is  needed,  in  the  hope  that  it  will  be  re¬ 
quired  sometime  in  the  future.  Prefetching  has  been  used  in 
a  variety  of  systems  with  positive  results.  A  Calypso  worker 
implements  prefetching  by  monitoring  its  own  data  access 
patterns  and  page-faults,  and  it  tries  to  predict  future  data 
access  based  on  past  history.  The  predictions  are  then  used 
to  pre-request  shared  pages  from  the  manager.  Depend¬ 
ing  on  the  regularity  of  a  program’s  data  access  patterns, 
prefetching  has  shown  positive  results. 
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3.1.5  Performance  Experiments 

The  experiments  were  conducted  on  up  to  17  identi¬ 
cal  200  MHz  PentiumPro  machines  running  Linux  version 
2.0.34  operating  system,  and  connected  by  a  100Mbps  Eth¬ 
ernet  through  a  non-switched  hub.  The  network  was  iso¬ 
lated  to  eliminate  outside  effects. 

A  publicly  available  sequential  ray  tracing  program  [10] 
was  used  as  the  starting  point  to  implement  parallel  versions 
in  Calypso  and  PVM  [16].  The  sequential  program,  which 
traced  a  512  x  512  image  in  in  53  s,  is  used  for  calculating 
the  parallel  efficiencies. 

The  PVM  implementation  used  explicit  master/slave 
programming  style  for  load  balancing,  where  as  for  Ca¬ 
lypso,  load  balancing  was  provided  transparently  by  the 
run-time  system.  To  demonstrate  the  effects  of  adaptiv¬ 
ity,  the  PVM  and  Calypso  programs  were  parallelized  using 
different  number  of  tasks  and  executed  from  1  to  16  ma¬ 
chines.  The  performance  results  are  illustrated  in  Figure  3. 
As  the  results  indicate,  the  PVM  program  is  very  sensitive 
to  the  number  and  the  computation  requirement  of  the  par¬ 
allel  tasks,  and  at  most,  a  hand-tuned  PVM  program  outper¬ 
forms  a  Calypso  program  by  4%.  Notice  that  independent 
of  the  number  of  machines  used,  the  interplay  of  bunching, 
eager  scheduling,  and  TIES  allows  the  Calypso  program  to 
achieves  its  peak  performance  using  512  tasks — fine  grain 
tasks:  as  the  result  of  bunching,  fine-grain  tasks,  in  effect, 
execute  in  coarse-grain  fashion;  the  combination  of  eager 
scheduling  and  TIES  compensates  any  over-bunching  that 
may  occur. 

3*2  Chime 

Chime  is  a  parallel  processing  system  that  retains  the 
salient  features  of  Calypso,  but  supports  a  far  richer  set  of 
programming  features.  The  internals  of  Chime  are  signifi¬ 
cantly  different  from  Calypso,  and  it  runs  on  the  Windows 
NT  operating  system  [27].  Chime  is  the  first  system  that 
provides  a  true  general  shared  memory  multiprocessor  en¬ 
vironment  on  a  network  of  machines.  It  achieves  this  by  im¬ 
plementing  the  CC-H-i-  [8]  language  (shared  memory  part)  on 
a  distributed  system.  Thus  in  addition  to  Calypso  features 
of  fault-tolerance  and  load  balancing  Chime  provides: 

•  True  multi-processor  shared-memory  semantics  on  a 
network  of  machines. 

•  Block  structured  scoping  of  variables  and  non-isolated 
distributed  parallel  execution. 

•  Support  for  nested  parallelism. 

•  Inter-task  synchronization. 

3.2.1  Chime  Architecture 

A  program  written  in  CC+  is  preprocessed  to  convert  it  to 
C++  and  compiled  and  linked  with  the  Chime  library.  Then 


the  executable  is  run,  using  the  manager-worker  scheme  of 
Calypso. 

The  manager  process  consists  of  two  threads,  the  appli¬ 
cation  thread  and  the  control  thread.  The  application  thread 
executes  the  code  programmed  by  the  programmer.  The 
control  thread  executes,  exclusively,  the  code  provided  by 
the  Chime  library.  Hence,  the  application  thread  runs  the 
program  and  the  control  thread  runs  the  management  rou¬ 
tines,  such  as  scheduling,  memory  service,  stack  manage¬ 
ment,  and  synchronization  handling. 

The  worker  process  also  consists  of  two  threads,  the  ap¬ 
plication  thread  and  the  control  thread.  The  application 
threads  in  the  worker  and  manager  are  identical.  However, 
the  control  thread  in  the  worker  is  the  client  of  the  control 
thread  in  the  manager.  It  requests  work  from  the  manager, 
retrieves  data  pages  from  the  manager  and  flushes  updated 
memory  to  the  manager  at  the  end  of  the  task  execution. 

3.2.2  Chime  and  CC++ 

As  mentioned  earlier,  Chime  provides  a  programming 
interface  that  is  based  on  the  Compositional  C-h-i-  or 
CC++  [8]  language  definition.  CC+-I-  provides  language 
constructs  for  shared  memory,  nested  parallelism  and  syn¬ 
chronization.  All  threads  of  the  parallel  computation  share 
all  global  variables.  Variables  declared  local  to  a  function 
are  private  to  the  thread  running  the  function,  but  if  this 
thread  creates  more  threads  inside  the  function,  then  all  the 
children  share  the  local  variables. 

CC++  uses  the  par  and  parfor  statements  to  express  par¬ 
allelism.  Par  and  parfor  statements  can  be  nested.  CC++ 
uses  single  assignment  variables  for  synchronization.  A  sin¬ 
gle  assignment  variable  is  assigned  a  value  by  any  thread 
called  the  writing  thread.  Any  other  thread,  called  the  read¬ 
ing  thread  can  read  the  written  value.  The  constraint  is  that 
the  writing  thread  has  to  assign  before  the  reading  thread 
reads,  else  the  reading  thread  is  blocked  until  the  writing 
thread  assigns  the  variable. 

These  language  constructs  provide  significant  challenges 
to  a  distributed  (DSM-based)  implementation  that  is  also 
fault  tolerant.  We  achieved  the  implementation  by  us¬ 
ing  a  pre-processor  to  detect  the  shared  variables  and  par¬ 
allel  constructs,  providing  stack-sharing  support — called 
distributed  cactus  stacks — to  implement  parent-child  vari¬ 
able  sharing  and  innovative  scheduling  techniques,  coupled 
with  appropriate  memory  flushing  to  provide  synchroniza¬ 
tion  [28]. 

3.2.3  Preprocessing  CC++ 

Consider  the  following  parallel  statement: 

parfor  (  int  i=0;  i<100;  i++)  { 
a[i]  =  0; 

}; 
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Figure  4.  A  DAG  for  a  nested  parallel  step. 


This  creates  100  tasks,  each  task  assigning  one  element 
of  the  array  a.  The  preprocessor  converts  the  above  state¬ 
ment  to  something  along  the  following  lines: 

1.  for  (int  i=0;  i<100;i++)  { 

2 .  add  task  entry  and  &i 

in  the  scheduling  table; 

} 

3 .  SaveContext  of  this  thread; 

4  .  if  worker  { 

a[i]  =  0; 

5.  terminate  task; 

} 

6 .  else  { 

7 .  suspend  this  thread  and 
request  manager  to 
schedule  threads  till 
all  tasks  completed; 

} 

The  above  code  may  execute  in  the  manager  (top  level 
parallelism)  or  the  worker  (nested  parallelism).  Assume  the 
above  code  executes  in  the  manager.  Then  the  application 
thread  of  the  manager  executes  the  code.  Lines  1  and  2  cre¬ 
ate  100  entries  in  the  scheduling  table,  one  per  parallel  task. 
Then  line  3  saves  the  context  of  the  parent  task,  including 
the  parent  stack.  Then  the  parent  moves  to  line  7  and  this 
causes  the  application  thread  to  transfer  control  to  the  con¬ 
trol  thread. 

The  control  thread  now  waits  for  task  assignment  re¬ 
quests  from  the  control  threads  of  workers.  When  a  worker 
requests  a  task,  the  manager  control  thread  sends  the  stored 
context  and  the  index  value  of  i  for  a  particular  task  to  the 
worker. 

The  control  thread  in  the  worker  installs  the  received 
context  and  the  stack  on  the  application  thread  in  the  worker 
and  resumes  the  application  thread.  This  thread  now  starts 
executing  at  line  4.  Note  that  now  the  worker  is  execut¬ 
ing  at  line  4,  and  hence  does  one  iteration  of  the  loop  and 


Figure  5.  Graphical  representation  of  a  cactus 
stack. 


terminates.  Upon  termination,  the  worker  control  thread  re¬ 
gains  control,  flushes  the  updated  memory  to  the  manager 
and  asks  the  manger  for  a  new  assignment. 

3.2.4  Scheduling 

The  controlling  thread  at  the  manager  is  also  responsi¬ 
ble  for  task  assignment,  or  scheduling.  The  manager  uses 
a  scheduling  algorithm  that  takes  care  of  task  allocation  to 
the  workers  as  well  as  scheduling  of  nested  parallel  tasks  in 
correct  order.  Nested  parallel  tasks  in  an  application  form  a 
DAG  as  shown  in  Figure  4. 

Each  nested  parallel  step  consists  of  several  sibling  par¬ 
allel  tasks.  It  also  has  a  parent  task  and  a  continuation  that 
must  be  executed,  once  the  nested  parallel  step  has  been 
completed.  A  continuation  is  an  object  that  fully  describes 
a  future  computation.  To  complicate  the  scenario,  a  contin¬ 
uation  may  itself  have  nested  parallel  step(s). 

The  manager  maintains  an  execution  dependency  graph 
to  capture  the  dependencies  between  the  parallel  tasks  and 
schedules  them  and  their  corresponding  continuations  in 
correct  order.  Eager  scheduling  is  used  to  allocate  tasks  to 
the  workers. 

3.2.5  Cactus  Stacks 

The  cactus  stacks  are  used  to  handle  sharing  of  local  vari¬ 
ables  (see  Figure  5).  For  top  level  nesting,  the  manager  pro¬ 
cess  is  suspended  at  a  point  in  execution  where  its  stack 
and  context  should  be  inherited  by  all  the  children  threads. 
When  a  worker  starts,  it  is  sent  the  contents  of  the  man¬ 
ager’s  stack  along  with  the  context.  The  controlling  thread 
of  the  worker  process  then  installs  this  context  as  well  as  the 
stack,  and  starts  the  application  thread. 
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However,  if  a  worker  executes  a  nested  parallel  step,  the 
same  code  as  the  above  case  is  used,  but  the  runtime  system 
behaves  slightly  differently.  The  worker,  after  generating 
the  nested  parallel  jobs,  invokes  a  routine  that  adds  the  jobs 
and  the  continuation  of  the  parent  job  to  the  manager’s  job 
table,  remotely.  The  worker  suspends  and  the  controlling 
thread  in  the  worker,  sends  the  worker’s  complete  context, 
including  the  newly  grown  stack,  to  the  manager. 

The  stack  for  a  nested  parallel  task,  therefore,  is  con- 
structed  by  writing  the  stack  segments  of  its  ancestors  onto 
the  stack  of  a  worker’s  application  thread.  Upon  comple¬ 
tion,  the  local  portion  of  the  stack  for  a  nested  parallel  task 
is  unwound  leaving  only  those  portions  that  represent  its 
ancestors.  This  portion  of  the  stack  is  then  XOR’ed  with  its 
unmodified  shadow  and  the  result  is  returned  to  the  man¬ 
ager. 


3.2.6  Performance  Experiments 


Many  performance  tests  have  been  done  on  Chime  [27], 
evaluating  its  capabilities  in  speedups,  load  balancing,  and 
fault  tolerance.  The  results  are  competitive  to  other  sys¬ 
tems,  including  Calypso.  We  present  three  micro-tests  that 
show  the  performance  of  the  nested  parallelism  (including 
cactus  stacks),  the  Chime  synchronization  mechanisms,  and 
preemptive  scheduling  mechanisms. 
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Figure  6.  Performance  of  Nested  Parallelism. 

For  the  nested  parallelism  overhead,  we  run  a  program 
that  recursively  creates  two  child  threads  until  1024  leaf 
threads  have  been  created.  Each  leaf  thread  assigns  one  in¬ 
teger  in  a  shared  array  and  then  terminates.  Figure  6  shows 
that  the  total  runtime  of  the  program  asymptotically  satu¬ 
rates  as  number  of  machines  are  increased,  due  to  the  bottle¬ 
neck  in  stack  and  thread  management  at  the  manager.  The 
time  taken  to  handle  all  overhead  for  a  thread  (including 
cactus  stacks)  is  74  ms. 

To  measure  the  synchronization  overhead,  we  use  512 
single  assignment  variables,  assign  them  from  512  threads 
and  read  them  from  512  other  threads.  As  can  be  seen  in 
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Figure  7,  Performance  of  Synchronization. 


Figure  7,  the  synchronization  overhead  is  about  86  ms  per 
occurrence,  showing  that  synchronization  does  not  add  too 
much  overhead  over  basic  thread  creation. 
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Figure  8.  Performance  of  Preemptive 
Scheduling. 


To  measure  the  impact  of  preemptive  scheduling  algo¬ 
rithms  for  programs  with  different  grain  sizes,  we  decom¬ 
posed  a  matrix-multiply  algorithm  on  two  1500  x  1500 
matrices  into  5  tasks  (very  coarse  grain),  10  tasks  (coarse 
grain),  21  tasks  (medium  grain),  and  1500  tasks  (fine  grain). 
All  experiments  used  three  identical  machines.  Given  the 
equal  task  lengths,  our  experiments  were  biased  against  pre¬ 
emptive  schedulers.  As  shown  in  Figure  8,  on  the  overall, 
preemptive  scheduling  has  definite  advantages  over  non- 
preemptive  scheduling,  not  withstanding  its  additional  over¬ 
heads.  Specifically,  for  coarse-grained  and  very  coarse¬ 
grained  tasks,  round  robin  scheduling  effectively  comple¬ 
ments  eager  scheduling  in  reducing  overall  execution  time. 
For  most  other  task  sizes,  the  preemptive  task  bunching  al¬ 
gorithm  yields  the  best  performance;  for  fine-grained  tasks 
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it  minimizes  the  number  of  preemptions  that  are  necessary. 

3.3  Charlotte 

Many  of  the  assumptions  made  for  (local-area)  networks 
of  machines  are  not  valid  for  the  Web.  For  example,  the  ma¬ 
chines  on  the  Web  do  not  have  a  common  shared  file  system, 
no  single  individual  has  access-rights  (user-account)  on  ev¬ 
ery  machine,  and  the  machines  are  not  homogeneous.  An¬ 
other  important  distinction  is  the  concept  of  users.  A  user 
who  wants  to  execute  a  program  on  a  network  of  machines, 
typically  performs  the  steps:  logs  onto  a  machine  under  her 
control  (i.e.  the  local  machine),  from  the  local  machine  logs 
onto  other  machines  on  the  network  (i.e,  remote  machines) 
and  initializes  the  execution  environment,  and  then  starts 
the  program.  In  the  case  of  the  Web,  no  user  can  possibly 
hope  to  have  the  ability  to  log  onto  remote  machines.  Thus, 
another  set  of  users  who  control  remote  machines,  or  soft¬ 
ware  agents  acting  on  their  behalf,  must  voluntarily  allow 
others  access.  To  distinguish  the  two  types  of  users,  this 
section  uses  the  term  end-users  to  refer  to  individuals  who 
start  the  execution  (on  their  local  machines)  and  await  re¬ 
sults,  and  volunteers  to  refer  to  individuals  who  voluntarily 
run  parts  of  end-users’  programs  on  their  machines  (remote 
to  end-users).  Similarly,  volunteer  machines  is  used  to  refer 
to  machines  owned  by  volunteers. 

Simplicity  and  security  are  important  objectives  for  vol¬ 
unteers.  Unless  the  process  of  volunteering  a  machine  is 
simple — ^for  example  as  simple  as  a  single  mouse-click — 
and  the  process  of  withdrawing  a  machine  is  simple,  it  is 
likely  that  many  would-be  volunteer  machines  will  be  left 
idle.  Furthermore,  volunteers  need  assurance  that  the  in¬ 
tegrity  of  their  machine  and  file  system  will  not  be  compro¬ 
mised  by  allowing  “strangers”  to  execute  computations  on 
their  machines.  Without  such  an  assurance,  it  is  natural  to 
assume  security  concerns  will  outweigh  the  charitable  will¬ 
ingness  volunteering. 

Charlotte  [6]  is  the  first  parallel  programming  system  to 
provide  one-click  computing.  The  idea  behind  one  click 
computing  is  to  allow  volunteers  from  anywhere  on  the 
Web,  and  without  any  administrative  effort,  to  participate  in 
ongoing  computations  by  simply  directing  a  standard  Java- 
capable  browser  to  a  Web  site.  A  key  ingredient  in  one-click 
computing  is  its  lack  of  requirements:  user-accounts  are  not 
required,  the  availability  of  the  program  on  a  volunteer’s 
machine  is  not  assumed,  and  system-administration  is  not 
required.  Charlotte  builds  on  the  capability  of  the  growing 
number  of  Web  browsers  to  seamlessly  load  Java  applets 
from  remote  sites,  and  the  applet  security  model,  which  en¬ 
ables  Web  browsers  to  execute  untrusted  applets  in  a  con¬ 
trolled  environment,  to  provide  a  comprehensive  program¬ 
ming  system. 


3.3.1  Charlotte  Programs 

A  Charlotte  program  is  written  by  inserting  any  number 
of  parallel  steps  onto  a  sequential  Java  program.  A  parallel 
step  is  composed  of  one  or  more  routines,  which  are  (se¬ 
quential)  threads  of  control  capable  of  executing  on  remote 
machines. 

A  parallel  step  starts  and  ends  with  the  invocation  of 
parBegin  ( )  and  par  End  ( )  methods,  respectively.  A 
routine  is  written  by  subclassing  the  Dr ou tine  class  and 
overriding  its  drum  ( )  method.  Routines  are  specified 
by  invoking  the  addRoutine  ( )  method  with  two  argu¬ 
ments:  a  routine  object  and  an  integer,  n,  representing  the 
number  of  routine  instances  to  execute.  To  execute  a  rou¬ 
tine,  the  Charlotte  runtime  system  invokes  the  drun  ( ) 
method  of  routine  objects,  and  passes  as  arguments  the 
number  of  routine  instances  created  (i.e.  n)  and  an  identifier 
in  the  range  (0, . . . ,  n]  representing  the  current  instance. 

A  program’s  data  is  logically  partitioned  into  private  and 
shared  segments.  Private  data  is  local  to  a  routine  and  is 
not  visible  to  other  routines;  shared  data,  which  consists 
of  shared  class-type  objects,  is  distributed  and  is  visible 
to  all  routines.  For  every  basic  data-type  defined  in  Java, 
Charlotte  implements  a  corresponding  distributed  shared 
class-type.  For  example,  Java  provides  int  and  float 
data-types,  whereas  Charlotte  provides  Dint  and  Df  loat 
classes.  The  class-types  are  implemented  as  standard  Java 
classes,  and  are  read  and  written  by  invoking  get  { )  and 
set  ( )  method  calls,  respectively.  The  runtime  system 
maintains  the  coherence  of  shared  data. 

3.3.2  Implementation 

Worker  Process:  A  Charlotte  worker  process  is  imple¬ 
mented  by  the  Cdaemon  class  which  can  run  either  as  a 
Java  application  or  as  a  Java  applet.  At  instantiation,  a 
Cdaemon  object  establishes  a  TCP/IP  connection  to  the 
manager  and  maintains  this  connection  throughout  the  com¬ 
putation. 

Two  implementation  features  are  >vorth  noting.  First, 
since  Cdaemon  is  implemented  as  an  applet  (as  well  as  an 
application),  the  code  does  not  need  to  be  present  on  volun¬ 
teer  machines  before  the  computation  starts.  By  simply  em¬ 
bedding  the  Cdaemon  applet  in  an  HTML  page,  browsers 
can  download  and  execute  the  worker  code.  Second,  the 
Cdaemon  class,  unlike  its  counterpart  the  Calypso  worker, 
is  independent  of  the  Charlotte  program  it  executes.  Thus, 
not  only  are  Charlotte  workers  able  to  execute  parallel  rou¬ 
tines  of  any  Charlotte  program,  but  only  the  necessary  code 
segments  are  transfered  to  volunteer  machines. 

Manager  Process:  A  manager  process  begins  with  the 
main  ( )  method  of  a  program  and  executes  the  non-parallel 
steps  in  a  sequential  fashion.  It  also  manages  the  progress  of 
parallel  steps  by  providing  scheduling  and  memory  services 
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Figure  9.  Performance  comparison  of  Char¬ 
lotte,  RMi  and  JPVM  programs. 


Figure  10.  Load  baiancing  of  Charlotte,  RMI 
and  JPVM  programs. 


to  workers.  They  are  based  on  eager  scheduling,  bunching, 
and  TffiS. 

Distributed  Shared  Class  Types:  Charlotte’s  distributed 
shared  memory  is  implemented  in  pure  Java  at  the  data-type 
level;  that  is,  through  Java  classes  as  stated  above.  For  each 
primitive  Java  type  like  int  and  float,  there  is  a  cor¬ 
responding  Charlotte  class-type  Dint  and  Df  loat.  The 
member  variables  of  these  classes  are  a  value  field  of  the 
corresponding  primitive  type,  and  a  state  flag  that  can 
be  not-valid,  readable,  or  dirty.  It  is  important  to 
note  that  different  parts  of  the  shared  data  can  be  updated 
by  different  worker  processes  without  false  sharing,  as  long 
as  the  CRCW-Common  condition  is  met.  (That  is,  several 
workers  in  a  step  can  update  the  same  data  element,  as  long 
as  all  of  them  write  the  same  value.)  The  shared  memory 
is  always  logically  coherent,  independently  of  the  order  in 
which  routines  are  executed. 

3.3.3  Performance  Experiments 

The  experiments  were  conducted  in  the  same  execu¬ 
tion  environment  as  in  Section  3.1.5.  Programs  were  com¬ 
piled  (with  compiler  optimization  turned  on)  and  executed 
in  the  Java  Virtual  Machine  (JVM)  packaged  with  Linux 
JDK  1.1.5  v7.  TYA  version  0.07  [22]  provided  just-in-time 
compilation. 

A  publicly  available  sequential  ray  tracing  program  [24] 
was  used  as  the  starting  point  to  implement  parallel  ver¬ 
sions  in  Charlotte,  Java  RMI  [14],  and  JPVM  [15].  Java 
RMI  is  an  integral  part  of  Java  1.1  standard  and,  therefore, 
it  is  a  natural  choice  for  comparison.  JPVM  is  a  Java  im¬ 
plementation  of  PVM,  one  of  the  most  widely  used  parallel 
programming  systems.  For  the  experiments,  a  500  x  500 
image  was  traced.  The  sequential  program  took  154  s  to 
complete,  and  this  number  is  used  as  the  base  in  calculating 
the  speedups. 


The  first  series  of  experiments  compares  the  perfor¬ 
mance  of  the  three  parallel  implementations  of  ray  tracer, 
see  Figure  9.  In  the  case  of  Charlotte,  the  same  program 
with  the  same  runtime  arguments  was  used  for  every  run — 
the  program  tuned  itself  to  the  execution  environment.  For 
RMI  and  JPVM  programs,  on  the  other  hand,  executions 
with  different  grain  sizes  were  timed  and  the  best  results  are 
reported — the  programs  were  hand-tuned  for  the  execution 
environment.  The  results  indicate  that  when  using  16  vol¬ 
unteers,  the  Charlotte  implementation  runs  within  5%  and 
10%  of  hand-tuned  JPVM  and  RMI  implementations,  re¬ 
spectively.  It  is  encouraging  to  see  that  the  performance  of 
Charlotte  is  competitive  with  other  systems  that  do  not  pro¬ 
vide  load  balancing  and  fault  masking. 

The  final  set  of  experiments  illustrates  the  efficiency 
of  the  programs  when  executing  on  machines  of  varying 
speeds — a  common  scenario  when  executing  programs  on 
the  Web.  Exactly  the  same  programs  with  the  same  gran¬ 
ularity  sizes  as  the  previous  experiment  were  run  on  n, 
1  ^  ^  4,  groups  of  volunteers,  where  each  group  con¬ 

sisted  of  four  machines:  one  normal  machine,  one  machine 
slowed  down  by  25%,  one  machine  slowed  down  by  50%, 
and  one  machine  slowed  down  by  75%.  Each  group  has 
a  computing  potential  of  2.5  volunteer  machines.  The  re¬ 
sults  are  depicted  in  Figure  10.  As  the  results  indicate, 
the  Charlotte  program  is  the  only  one  able  to  maintain 
its  efficiency — the  efficiency  of  the  Charlotte  program  de¬ 
graded  by  approximately  5%.  In  contrast,  the  efficiency  of 
RMI  and  PVM  programs  dropped  by  as  much  as  60%  and 
45%,  respectively. 

3.4  ResourceBroker 

ResourceBroker  [4]  is  a  resource  management  system 
for  monitoring  computing  resources  in  a  distributed  mul- 
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Figure  11,  The  components  of  ResourceBroker  that  comprise  of  the  resource  management  and  the 
agent  layers. 


tiuser  environments  and  for  dynamically  assigning  them  to 
concurrently  executing  computations.  Although  applica¬ 
ble  to  a  wide  variety  of  computations,  including  sequential 
ones,  it  especially  benefits  adaptive  parallel  computations. 

Adaptive  parallel  computations  can  effectively  use  net¬ 
worked  machines  because  they  dynamically  expand  as  ma¬ 
chines  become  available  and  dynamically  acquire  machines 
as  needed.  While  most  parallel  programming  systems  pro¬ 
vide  the  means  to  develop  adaptive  programs,  they  do  not 
provide  any  functional  interface  to  external  resource  man¬ 
agement  systems.  Thus,  no  existing  resource  management 
system  has  the  capability  to  manage  resources  on  commod¬ 
ity  system  software,  arbitrating  the  demands  of  multiple 
adaptive  computations  written  using  diverse  programming 
environments.  Indeed,  existing  resource  management  sys¬ 
tems  are  tightly  integrated  with  the  programming  system 
they  support  and  their  inability  to  support  more  than  one 
programming  system  severely  limits  their  applicability. 

ResourceBroker  is  built  to  validate  a  set  of  novel  mech¬ 
anisms  that  facilitate  dynamic  allocation  of  resources  to 
adaptive  parallel  computations.  The  mechanisms  utilize 
low-level  features  common  to  many  programming  systems, 
and  unique  in  their  ability  to  transparently  manage  adaptive 
parallel  programs  that  were  not  developed  to  have  their  re¬ 
sources  managed  by  external  systems.  The  ResourceBroker 
prototype  is  the  first  system  that  can  support  adaptive  pro¬ 
grams  writfen  in  more  than  one  programming  system,  and 
has  been  tested  using  a  mix  of  programs  written  in  PVM, 
MPI,  Calypso,  and  PLinda. 

4  Computing  Communities:  Metacomputing 
for  General  Computations 

So  far,  we  have  addressed  metacomputing  for  parallel 
computations.  Operating  systems  such  as  Amoeba  [29], 
Plan-9  [26],  Clouds  [12]  and  to  an  extent  Mach  [1]  had  tar¬ 
geted  the  use  of  distributed  systems  for  seamless  general 


purpose  computing.  However,  the  rise  of  commodity  oper¬ 
ating  systems  and  the  need  for  application  binary  compat¬ 
ibility  have  made  such  approaches  less  attractive,  necessi¬ 
tating  instead  that  general  computations  also  be  supported 
on  metacomputing  environments.  To  enable  the  latter,  we 
have  designed  and  will  implement  the  Computing  Commu¬ 
nity  (CC)  framework. 

A  Computing  Community  (CC)  is  a  collection  of  ma¬ 
chines  (with  dynamic  membership)  that  form  a  single,  dy¬ 
namically  changing,  virtual  multiprocessor  system.  It  has 
global  resource  management,  dynamic  (automatic)  recon¬ 
figurability,  and  the  ability  to  run  binaries  of  all  applica¬ 
tions  designed  for  a  base  operating  system.  The  physical 
network  disappears  from  the  view  of  the  computations  that 
run  on  the  CC. 

The  CC  brings  flexibility  of  well-designed,  distributed 
computing  environments  to  the  world  of  non-distributed 
applications-including  legacy  applications- without  the  need 
for  distributed  programming,  new  APIs,  RPCs,  object- 
brokerage,  or  similar  mechanisms. 

4*1  Realizing  a  CC 

We  are  in  the  process  of  building  a  CC  on  top  of  the  Win¬ 
dows  NT  operating  system,  with  the  initial  software  archi¬ 
tecture  shown  in  Figure  12.  The  CC  comprises  three  syner¬ 
gistic  components:  (1)  Virtual  Operating  System  (2)  Global 
Resource  Manager  (3)  Application  Adaptation  System. 

The  Virtual  Operating  System  (VOS)  is  a  layer  of  soft¬ 
ware  that  non-intrusively  operates  between  the  applications 
and  the  standard  operating  system.  The  VOS  presents  the 
standard  Windows  NT  API  to  the  application,  but  can  ex¬ 
ecute  the  same  API  calls  differently,  thereby  extending  the 
OS’s  power.  The  VOS  essentially  decouples  the  virtual  en¬ 
tities  required  for  executing  a  computation  from  their  map¬ 
pings  to  physical  resources  in  the  CC. 

The  Global  Resource  Manager  manages  all  CC  re¬ 
sources,  dynamically  discovering  the  availability  of  new  re- 
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Figure  12.  Software  architecture  for  Computing  Communities. 


sources,  integrating  them  into  the  CC,  and  making  them 
available  for  use  by  CC  computations.  It  handles  resource 
requests  from  other  components  of  the  system  and  satisfies 
them  as  per  scheduling  requirements. 

The  Application  Adaptation  System  enables  the  compu¬ 
tations  to  take  full  advantage  of  CC  resources  and  pro¬ 
vides  dynamic  reconfiguration  capabilities.  Adaptation 
techniques  allow  computations  to  become  aware  of  and 
gracefully  adapt  themselves  to  changes  in  CC  resource  char¬ 
acteristics. 

Figure  13  shows  a  conceptual  view  of  a  CC.  It  takes  a 
set  of  operating  systems,  and  a  set  of  resources,  and  via  a 
layer  of  middleware  converts  it  into  an  integrated  commu¬ 
nity.  CCs  can  expand  and  contract  dynamically,  and  the 
computations  are  completely  mobile  within  CCs.  In  short, 
using  the  CC  framework,  the  computation  transparently  ac¬ 
quires  the  benefit  of  operating  in  a  distributed  environment. 

4.2  The  Virtualization  Concept 

Under  a  standard  OS,  a  process  runs  in  a  logical  address 
space,  is  bound  to  a  machine,  and  interacts  with  the  OS  lo¬ 
cal  to  this  machine.  In  fact,  the  processes  (and  their  threads) 
are  virtualizations  of  the  real  CPUs.  However,  such  virtual¬ 
ization  is  low-level  and  limited  in  scope. 

In  the  CC,  virtualization  is  defined  at  a  much  higher 
level,  and  all  physical  resources  (CPU,  memory,  disks,  and 
networks)  as  well  as  the  OSs  on  to  all  the  machines  are  ag¬ 
gregated  into  a  single,  unified  (distributed)  virtual  resource 
space. 

A  process  in  the  CC  is  enveloped  in  a  virtual  shell  (Fig¬ 
ure  14),  which  makes  the  process  feel  that  it  is  running  on 
a  standard  OS.  However,  the  shell  creates  a  virtual  world 
made  of  the  aggregate  of  the  physical  worlds  in  the  CC. 


Consider  a  user  U  who  starts  an  application  A  (and  its 
GUI)  on  some  machine  Mi.  Soon,  U  abruptly  moves  to 
another  machine  M2 .  Now  U  can  instruct  the  CC  to  connect 
the  virtual  screen,  virtual  keyboard,  and  the  virtual  mouse  of 
A  to  the  physical  resources  of  M2.  The  CC  complies  and  U 
continues  working  on  M2,  as  if  A  executed  there.  Later  the 
CC  might  decide  it  preferable  to  run  the  application  on  M2. 
The  scheduler  then  transparently  moves  A  to  M2  preserving 
process  state  and  open  files  and  network  connections. 

The  above  simple  scenario  shows  a  particular  aspect  of 
the  power  of  virtualization.  In  general: 

•  The  users  can  move  their  virtual  ’’home  machines”  at 
will,  even  for  applications  that  are  currently  executing. 
This  is  the  ultimate  mobile  computing  scenario. 

•  A  critical  service  running  on  machine  Mi  can  be 
moved  to  machine  M2  if  Mi  has  to  be  relinquished. 

•  Schedulers  can  control  the  complete  set  of  resources. 

•  The  provision  of  multiple  physical  resources  for  a  sin¬ 
gle  virtual  resource  delivers  important  new  capabili¬ 
ties  ranging  from  duplicating  application  displays  on 
multiple  screens  to  replicating  processes  for  fault  tol¬ 
erance. 

The  CC  functionality  relies  upon  three  key  mechanisms: 
API  interception,  proxies,  and  translations  between  physi¬ 
cal  and  logical  handles.  API  interception  allows  the  API 
calls  from  an  application  to  the  operating  system  to  be  in¬ 
tercepted  and  the  behavior  of  the  API  call  to  be  modified. 
After  intercepting  a  call,  the  virtual  operating  system  (VOS) 
does  one  of  the  following  operations.  (1)  Passes  the  call  on 
to  the  local  Windows  NT  operating  system.  (2)  Passes  the 
call  to  a  remote  Windows  NT  operating  system.  (3)  Exe¬ 
cutes  the  call  inside  the  VOS.  (4)  Executes  some  VOS  code 
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Figure  13.  A  conceptual  view  of  Computing  Communities. 


and  then  passes  the  call  to  a  local  or  remote  Windows  NT 
system. 

In  order  to  reallocate  processes  to  machines,  a  general 
form  of  process  migration  is  necessary.  To  move  a  process 
from  one  location  to  another,  just  moving  the  state  is  not 
enough,  all  connections  and  handles  have  to  be  moved.  This 
can  be  achieved  by  having  proxies  that  emulate  the  connec¬ 
tions  of  the  process  after  the  process  has  moved.  For  ex¬ 
ample,  if  a  process  P  moving  from  Mi  to  M2  has  an  open 
networking  connection  to  M3,  a  proxy  is  created  on  Mi, 
which  keeps  the  original  connection  to  M3  open,  and  then 
forwards  messages  between  P  and  M3,  after  P  has  moved. 

Equally  essential  to  successful  virtualization  of  re¬ 
sources  for  migrating  processes  is  the  use  of  virtual  han¬ 
dles.  For  example  when  a  process  opens  a  file  on  top  of  a 
VOS,  the  VOS  intercepts  this  call  and  stores  the  returned 
physical  handle  but  returns  to  the  process  a  handle,  which 
we  refer  to  as  virtual.  The  virtual  handle  can  be  used  by 
the  process,  regardless  of  migrations,  to  access  that  file,  due 
to  the  transparent  translation  service  provided  by  the  VOS. 
The  virtual  handles  are  used  to  virtualize  I/O  connections, 
sub-processes,  threads,  files,  network  sockets,  etc. 
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Abstract 

The  Management  System  for  Heterogeneous  Networks 
(MSHN)  is  a  resource  management  system  for  use  in 
heterogeneous  environments.  This  paper  describes  the 
goals  of  MSHN,  its  architecture,  and  both  completed  and 
ongoing  research  experiments.  MSHNk  main  goal  is  to 
determine  the  best  way  to  support  the  execution  of  many 
different  applications,  each  with  its  own  quality  of  service 
(QoS)  requirements,  in  a  distributed,  heterogeneous 
environment.  MSHNh  architecture  consists  of  seven 
distributed,  potentially  replicated  components  that 
communicate  with  one  another  using  CORBA  ( Common 
Object  Request  Broker  Architecture).  MSHNk 
experimental  investigations  include:  (1)  the  accurate, 
transparent  determination  of  the  end-to-end  status  of 
resources;  (2)  the  identification  of  optimization  criteria 
and  how  non-determinism  and  the  granularity  of  models 
affect  the  performance  of  various  scheduling  heuristics 
that  optimize  those  criteria;  (3)  the  determination  of  how 
security  should  be  incorporated  between  components  as 
well  as  how  to  account  for  security  as  a  QoS  attribute;  and 
(4)  the  identification  of  problems  inherent  in  application 
and  system  characterization. 


1.  Introduction 

The  Management  System  for  Heterogeneous  Networks 
(MSHN^)  project  seeks  to  determine  an  effective  design 
for  a  resource  management  system  (RMS)  that  can  deliver, 
whenever  possible,  the  required  quality  of  service  (QoS)  to 
individual  processes  that  are  contending  for  the  same  set 
of  distributed,  heterogeneous  resources.  Factors 
influencing  QoS  requirements  include  security,  user 
preferences  for  different  versions  of  an  application,  and 
deadlines.  A  set  of  QoS  requirements,  considered  together 
with  resource  availability,  determine  whether  all 
processes’requirements  can  be  met. 

An  RMS,  also  sometimes  called  a  meta-computing 
system,  is  similar  to  a  distributed  operating  system  in  that 
it  views  the  set  of  machines  that  it  manages  as  a  single 
virtual  machine  [50].  Also,  like  any  distributed  operating 
system,  it  attempts  to  give  the  user  a  location-transparent 
view  of  the  virtual  machine.  Hence,  as  in  the  case  of  a 
distributed  operating  system,  an  RMS  provides  users  with 
improved  performance  while  the  location  of  resources  is 
hidden.  The  set  of  users  of  a  system,  which  consists  of 
both  local  and  remote  resources,  that  is  managed  by  an 
RMS  should  be  able  to  attain  a  higher  level  of  availability 
and  more  fault  tolerance  than  would  be  available  from 
their  local  system  alone. 
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An  RMS  differs  from  a  distributed  operating  system  in 
that  it  does  not  micro-manage  the  resources  of  each 
computer.  Instead,  each  computer  runs  its  native 
operating  system.  Similarly,  each  router  executes  its  own 
protocol  and  each  file  server  executes  a  native  distributed 
file  system.  The  RMS  is  responsible  for  identifying  the 
large-grained  resources,  i.e.,  compute  servers  and  data 
repositories  that  should  be  used  by  each  process,  if  there  is 
a  choice.  It  may  be  responsible  for  issuing  a  command  to 
begin  execution  of  the  processes  that  comprise  an 
application.  It  may  monitor  the  status  of  both  the 
resources  in  the  system  and  the  progress  of  the 
applications  for  which  it  is  responsible. 

It  is  unclear  whether  every  request  to  execute  an 
application  that  is  submitted  to  any  operating  system  on 
any  of  the  machines  in  the  distributed  system  must  be 
controlled  by  the  RMS.  If  all  requests  are  controlled  by 
the  RMS,  then  allocation  policies  that  attempt  to  optimize 
throughput  for  a  set  of  well-understood  applications  will 
perform  better.  However,  sometimes  users  wish  to 
maintain  control  over  which  resources  their  application 
will  use. 

There  are  many  active,  on-going  research  projects,  in 
addition  to  MSHN,  in  the  area  of  resource  management, 
and  there  are  many  major  research  problems  to  be  solved 
[38].  A  problem  that  MSHN  is  not  addressing  is  the  best 
way  for  such  a  system  to  interact  with  human  users  to 
obtain  their  QoS  preferences  and  requirements  in  the  most 
user-friendly  way.  Indeed,  simply  identifying  the  syntax 
and  semantics  required  to  express  all  of  the  QoS 
preferences  and  requirements  is  a  difficult  problem 
[11][17][37][54].  While  MSHN  does  not  address  this 
problem,  the  designers  of  MSHN  expect  to  leverage 
results  from  research  in  this  area.  They  assume,  for 
example,  that  a  request  to  execute  an  application  is 
accompanied  by  a  list  of  deadlines,  preferences  for  various 
versions  of  an  application,  security  requirements,  and  any 
restrictions  on  the  variance  of  the  time  at  which  a  request 
should  be  completed. 

Before  leaving  the  general  topic  of  RMSs,  it  is 
imperative  that  we  address  the  topic  of  ‘packaging.” 
MSHN  researchers  do  not  see  the  fruits  of  the  RMS 
research  as  a  large,  monolithic  piece  of  software  that  will 
require  its  own  separate  installation  and  maintenance.  The 
best  way  to  package  the  eventual  outcomes  of  the  RMS 
projects  may  be  to  incorporate  them  into  an  infrastructure- 
or  middleware-level  standard  similar  to  the  Common 
Object  Request  Broker  Architecture  (CORBA),  Domain 
Name  Services,  or  other  such  resource  location  services. 
In  this  way,  an  RMS  would  not  need  to  be  separately 
maintained  and  would  be  consolidated  with  the  services 
that  distributed  applications  will  most  often  use. 
However,  it  is  still  worthwhile  to  separate  research  on 
RMSs  from  research  in  all  other  aspects  of  distributed 
object  computation  that  will  be  needed  in  future  versions 


of  such  standards  in  order  to  first  isolate,  then  solve  some 
of  the  difficult  resource  management  problems. 

1.1.  Background 

MSHN  evolved  in  part  from  a  scheduling  framework 
called  SmartNet  [19][28].  SmartNet^  goal  was  to  be  able 
to  wisely  schedule  sets  of  compute-intensive  jobs,  some  of 
which  may  require  the  execution  of  multiple  processes, 
onto  members  of  a  suite  of  heterogeneous  computers. 
SmartNet  provides  a  sophisticated  scheduling  module  that 
had  been  successfully  integrated  with  many  RMSs  and 
distributed  computing  environments.  Hence,  users  who 
need  to  execute  compute-intensive  jobs  and  have  access  to 
a  shared,  heterogeneous  environment  can  achieve  superior 
performance,  while  continuing  to  work  in  an  environment 
to  which  they  have  grown  accustomed  [21].  Additionally, 
for  those  users  who  do  not  already  have  one  installed, 
SmartNet  provided  a  basic  RMS  that  makes  use  of  its 
sophisticated  scheduling  capabilities.  SmartNet’s  major 
research  contributions  include: 

•  The  ability  to  predict  the  expected  run-time  of  a  job 
on  a  machine  using  the  concept  of  compute 
characteristics  and  information  collected  from 
previous  executions  of  the  job. 

•  The  ability  to  leverage  the  heterogeneity  inherent 
in  both  a  collection  of  jobs  as  well  as  in  a 
collection  of  computers. 

SmartNet  was  used  successfully  by  DoD  and  the 
National  Institutes  of  Health  in  scheduling  their  compute¬ 
intensive  jobs,  and  by  NASA’s  EOSDIS  system  in 
determining  whether  their  resources  were  adequate  to 
process  data  in  the  ways  desired  by  their  scientists. 

SmartNet’s  scheduling  algorithms  are  tuned  to  attempt 
to  minimize  the  time  at  which  the  last  job  completes, 
although  the  designers  of  SmartNet  recognized  that  similar 
algorithms  may  be  useful  in  optimizing  other  criteria.Of 
course,  minimizing  the  time  at  which  the  last  job,  of  a  set 
of  jobs,  completes  is,  in  general,  an  NP-complete  problem, 
so  SmartNet  employs  heuristics  when  it  searches  for  a 
near-optimal  mapping  of  jobs  to  machines  and  job 
execution  schedule.  Many  of  the  heuristics  that  it  uses  are 
well  known  and  previously  documented,  however,  they 
had  not  previously  been  used  in  a  practical  heterogeneous 
computing  system  [25].  It  is  likely  that  they  were  not 
previously  used  in  actual  systems  because  system 
designers  had  not  tried  to  estimate  average  process  run¬ 
times  and  because  it  was  not  previously  recognized  that 
exact  run-times,  though  helpful,  were  not  necessary 
[2][3][26]. 

1.2.  Overview  of  MSHNi  goals 

MSHN  differs  from  SmartNet  in  three  major  ways. 
First,  SmartNet  was  expected,  from  the  beginning,  to  be  a 
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system  that  would  actually  be  used  in  production.  For  this 
reason,  much  of  the  Sma^et  developers’ time  was  spent 
ensuring  that  SmartNet  was  at  SEI  Level  3.  Despite  this, 
SmartNet  was  able  to  make  significant  research 
contributions.  MSHN  is  intended  to  be  a  research  system, 
facilitating  experiments  by  the  investigators  to  determine 
how  RMSs,  that  have  somewhat  broader  goals  than 
SmartNet,  can  be  built.  MSHN’s  research  goals  expanded 
upon  SmartNet^  in  the  following  areas. 

(i)  MSHN  needs  to  consider  that  the  overhead  of 
jobs  sharing  resources,  such  as  networks  and  file 
servers,  can  have  significant  impact  on  mapping 
and  scheduling  decisions. 

(ii)  MSHN  must  support  adaptive  applications 
(defined  below). 

(iii)  MSHN  must  deliver  good  QoS  to  many  different 
sets  of  simultaneous  users,  some  of  whom  may 
be  executing  interactive  jobs;  others,  compute¬ 
intensive  jobs;  and  still  others,  real-time 
requirements. 

In  SmartNet^  model,  applications  consist  of  three 
distinct  phases.  In  the  first  phase,  which  is  short  compared 
to  the  second  phase,  they  acquire  data  from  a  data 
repository.  In  the  second  phase,  they  compute  results 
based  upon  the  data  that  they  obtained  during  the  first 
phase.  In  the  third  phase,  which  is  again  very  short 
compared  to  the  second  phase,  they  write  the  result  back  to 
a  possibly  different  repository.  Because  the  first  and  third 
phases  are  so  short,  SmartNet^  heuristics  assume  that 
there  is  no  contention  for  either  the  network  or  the  data 
repositories.  However,  they  do  account  for  the  time 
required  to  access  the  resources,  assuming  that  each 
application  is  the  sole  user  of  those  resources.  The  model 
of  applications  that  MSHN  is  meant  to  manage  is  more 
complex,  permitting  applications  to  transition  through 
many  more  phases  of  variable  length,  each  requiring  not 
only  sharing  of  compute  resources,  but  also  sharing  of 
network  and  data  repository  resources.  We  discuss  briefly 
in  this  paper,  and  elaborate  elsewhere,  both  the  problem  of 
modeling  the  application  and  that  of  accounting  for  lower 
level  policies  that  govern  the  sharing  of  resources.  That  is, 
because  MSHN  does  not  assume  that  it  has  any  control 
over  network  routing,  file  server  memory  allocation,  etc., 
it  models,  when  necessary,  the  lower  level  operating 
systems  and  protocols.  By  doing  so,  the  assignment  of 
processes  to  resources  will  account  for  the  sharing  of  those 
resources  in  the  correct  way. 

The  second  major  difference  between  SmartNet  and 
MSHN^  research  goals  is  that  MSHN  attempts  to  provide 
support  for  adaptive  and  adaptation-aware  applications. 
By  adaptive  applications,  we  mean  idempotent 
applications  that  can  exist  in  several  different  versions. 
Different  versions  may  have  different  valuesto  a  user  due 
to  factors  such  as  precision  of  computation  or  input  data. 
Additionally,  different  versions  may  have  different 


communication  and  computation  needs.  Or,  one  version 
may  execute  on  Windows  NT  while  another  version  is  an 
executable  for  Linux.  MSHN^  goal  is  to  support  adaptive 
applications  by  being  able  to  terminate  one  version  of  an 
application  if  MSHN  perceives  that  the  currently 
executing  version  will  not  meet  the  users’  QoS 
expectations.^  In  that  case,  MSHN  would  terminate  the 
executing  version  and  start  up  another  version  from  the 
beginning  (if  there  were  sufficient  resources  to  execute 
that  other  version).  The  requirement  that  adaptive 
applications  be  idempotent  permits  the  application  to  be 
safely  restarted  from  the  beginning  without  corrupting  any 
resource  such  as  a  database.  Similarly,  there  may  be  times 
when  MSHN  determines  that  delivery  of  a  better  QoS  is 
possible  to  a  user  by  changing  to  a  version  that  better 
meets  that  user^  preferences. 

An  adaptation-aware  application  differs  from  an 
adaptive  application  in  two  ways.  First,  when  it  is 
terminated,  the  new  version  need  not  be  restarted  from  the 
beginning.  Instead,  a  different  version  from  the  one  that 
terminated  may  be  started,  using  information  about  a 
previous  state  diat  was  obtained  from  the  execution  of  the 
previous  version.  Second,  an  adaptation-aware  application 
may  be  able  to  adapt  its  resource  usage  during  execution, 
without  restarting. 

Finally,  MSHN^  goals  differ  from  SmartNet^  in  that 
MSHN  seeks  to  determine  how  to  meet  multiple  different 
QoS  requirements  to  multiple  different  applications 
simultaneously.  There  are  really  two  issues  bound  up  in 
this  difference.  First,  a  way  to  incorporate,  dynamically, 
the  mixture  of  QoS  requirements  into  a  single  measure 
must  be  determined.  Second,  an  assignment  of 
applications  to  resources  must  also  be  determined  that 
optimizes  the  identified  measure.  In  resolving  this  second 
issue,  we  can  strongly  leverage  SmartNet^  emphasis  on 
the  separation  of  optimization  criteria  and  search 
algorithms  and  the  recognition  that  similar  algorithms  can 
be  used  to  search  many  different  types  of  spaces  for 
optimal  values.  We  elaborate  on  this  below. 

13.  Related  work 

There  are  other  research  groups  examining  the  issues 
important  to  building  an  RMS,  many  within  D ARP A^ 
Quorum  project.  Here,  we  look  at  some  of  the  projects 
related  to  MSHN.  Some  of  these  groups  are  engaged  in 
research  complementary  to  MSHN’s  goals.  For  the  sake 
of  brevity,  only  a  short  synopsis  of  each  project,  as  it 
relates  to  MSHN,  is  presented. 

DeSiPeRaTa  The  University  of  Texas  at  Arlington 
has  a  project  called  ‘DeSiDeRaTa:  QoS  Management 
Tools  for  Dynamic,  Scalable,  Dependable,  Real-Time 


"  We  note  that  a  version  of  one  application  may  be  terminated  because 
MSHN  detects  that  another  user’s  application  will  not  meet  its  QoS 
expectations.  This  phenomenon  can  occur  due  to  priorities. 
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Systems.”  DeSiDeRaTa  is  focusing  on  QoS  specification, 
QoS  metrics,  dynamic  QoS  management,  and 
benchmarking  of  specific  computing  environments,  such 
as  the  distributed  Anti-Air-W^are  system  at  the  Naval 
Surface  Warfare  Center,  Dahgren  Division.  A  unique 
concept  that  has  come  out  of  the  DeSiDeRaTa  project  is 
that  of  an  application  ‘t)ath”  [56]. 

Globus.  Globus  is  a  large,  joint  project  from  Argonne 
National  Laboratory  and  the  University  of  Southern 
Californians  Information  Sciences  Institute.  Parts  of  the 
Globus  project  are  devoted  toward  resource  management 
issues.  The  Globus  architecture  depends  on  an  advance  or 
immediate  resource  reservation  protocol  layer,  for  which  a 
standard  does  not  yet  exist  [14][18]. 

RT-ARM.  Honeywell  is  developing  a  ‘Real-Time 
Adaptive  Resource  Management”  system  aimed  primarily 
at  high-end,  real-time  military  embedded  systems  such  as 
the  Navy  Surface  Combatant  Ship  SC-21.  Some  of  the 
specific  issues  they  are  concentrating  on  include  modeling 
embedded  systems  and  finding  practical  techniques  for 
predictable  real-time  performance[24]. 

EPIO.  The  EPIQ  project,  from  the  University  of 
Illinois  at  Urbana-Champaign,  is  building  an  infrastructure 
for  providing  guaranteed  QoS  features,  upon  which  RMSs 
may  be  built,  part  of  their  infrastructure  involves  building 
their  own  runtime  environment[33]. 

ERDoS.  SRI  International  is  running  a  project  called 
ERDoS  (End  to  End  Resource  Management  for 
Distributed  Systems)  which  is  developingan  architecture 
for  adaptive  QoS-driven  resource  management.  The 
ERDoS  project  emphasizes  a  comprehensive  definition  of 
(JoS  and  the  development  of  models  that  capture 
information  required  for  making  resource  management 
decisions  [46]. 

QUASAR.  The  QUASAR  (QUAlity  Specification  and 
Adaptive  Resource  management  for  distributed  systems) 
project,  at  the  Oregon  Graduate  Institute  of  Science  and 
Technology,  is  investigating  techniques  for  specifying  and 
utilizing  QoS  in  adaptive,  distributed  systems.  QUASAR 
is  concentrating  on  the  translation  of  QoS  specifications 
from  the  application-level  to  the  resource-management- 
level,  and  its  use  in  reservation-based  resource 
management,  primarily  in  the  multimedia  domaiitSl]. 

ASSERT.  The  ASSERT  System  at  the  University  of 
Oregon,  Eugene,  is  focusing  on  dynamic,  distributed,  real¬ 
time  environments.  The  core  of  the  project  estimates  and 
monitors  the  relevant  QoS  parameters  of  running 
applications.  ASSERT  is  not  an  RMS,  nor  an  RMS 
framework;  rather,  the  ASSERT  project  is  looking  at  a 
specific  issue  of  RMSs:  QoS  monitoring  and  estimation 
[15]. 

OuO.  The  Quality  Objects  (QuO)  project,  from  BBN 
Systems  Technologies,  is  attempting  to  add  QoS 
specification  and  delivery  to  CORBA.  Rather  than 
provide  absolute  QoS  guarantees,  QuO  seeks  to  combine 


knowledge  about  resource  and  application  conditions  in 
order  to  reserve  enough  end-to-end  resources  for 
predictable  execution  of  distributed  applications[47]. 

MOL.  The  MOL  (Metacomputing  OnLine)  project 
from  the  Paderbom  Center  for  Parallel  Computing  has  as  a 
goal  the  utilization  of  multiple  high  performance  systems 
for  solving  problems  too  large  for  a  single  supercomputer. 
The  MOL  approach  does  not  assume  absolute  control  of 
resources  under  its  management.  The  MOL  project  is 
addressing  several  of  the  issues  key  to  resource 
management,  including  QoS  specification[40]. 

1.4.  Organization  of  the  paper 

In  the  next  section  of  the  paper  we  motivate  and  discuss 
MSHN^  architecture.  Even  though  SmartNet  was 
successful  in  achieving  its  functionality,  ratheithan  using 
SmartNet^  architecture  exactly,  we  based  MSHN’s 
architecture  upon  lessons  learned  from  SmartNet,  because 
MSHN^  goals  are  substantially  different.  In  particular, 
we  clearly  delineated  certain  of  SmartNet^  modules  into 
separate  components.  This  delineation  makes  it  easier  to 
experiment  with  different  designs  for  each  of  the 
components.  In  section  3,  we  then  discuss  many  of  the 
research  issues  that  the  MSHN  investigators  are  studying 
and  highlight  some  of  the  results.  Additionally,  this 
section  provides  references  to  the  numerous  articles  that 
describe  this  research  in  more  detail.  We  conclude  by 
summarizing  the  status  of  the  MSHN  project. 

2.  MSHNS  architecture 

In  this  section,  we  first  describe  some  of  the  concepts 
that  went  into  MSHN^  architectural  design.  This 
description  motivates  the  need  for  the  various  major 
components  and  explains  why  they  must  be  replicated  to 
varying  degrees.  The  architectural  design  was  driven  by 
the  need  to  support  the  RMS  research  that  we  will  discuss 
in  the  next  section  and  was  aided  by  our  previous 
experience  with  SmartNet.  We  then  present  MSHN^ 
current  architecture  in  detail. 

2.1.  Motivation 

We  first  motivate  the  need  for  each  of  the  major 
components  of  MSHN^  architecture,  then  discuss  how 
those  components  interact  with  one  another. 

We  recall  from  the  previous  section  that  an  RMS  needs 
to  transparently  locate  the  resources  that  should  be  used 
when  execution  of  an  application  is  requested.  Therefore, 
it  must  be  made  aware  of  any  request,  by  either  a  user  or 
an  application,  to  start  executing  another  application. 
Many  early  RMSs  required  the  user  to  explicitly  log  in  to 
the  system  to  start  a  job.  If  an  application  was  to  be 
started  from  within  another  application,  e.g.,  through 
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fork  and  exec  system  calls,  then  the  application  that 
makes  the  request  would  be  required  to  be  specially 
designed  to  embed  these  requests  within  a  function  call  to 
an  RMS  library.  This  restriction  required  that  applications 
be  specifically  written  or  modified  for  a  particular  RMS. 

The  MSHN  designers  do  not  want  to  force  a  user  to 
explicitly  log  into  an  RMS,  or  to  modify  their  existing 
programs.  Instead,  MSHN  transparently  intercepts  calls  to 
system  libraries  that  would  otherwise  initiate  execution  of 
a  new  process  and  diverts  those  calls  to  a  MSHN  Client 
Library.  After  MSHN  decides  where  the  newly  requested 
application  should  execute,  the  MSHN  Client  Library  uses 
whatever  mechanisms  available  at  the  resource  site  to 
initiate  execution  of  the  remote  process. 

The  environments  for  which  MSHN  is  designed  contain 
many  different  types  of  computers,  each  possibly 
executing  a  different  version  of  an  operating  system. 
Rather  than  requiring  the  Client  Library,  which  is  linked 
with  every  MSHN  application,  to  contain  a  substantial 
amount  of  code  that  is  specific  to  each  of  these  computers, 
we  chose  to  make  use  of  a  MSHN  Daemon.  Whenever  a 
computer  is  added  to  a  system,  a  MSHN  Daemon  is  started 
on  that  computer.  When  a  Client  Library  needs  to  start  a 
process  on  a  remote  machine,  it  simply  contacts  the 
MSHN  Daemon  on  that  machine  and  requests  that  the 
Daemon  start  the  process  on  the  Client  Library’s  behalf. 
Of  course,  the  general  mechanism  that  we  use  in  the 
Daemon  is  not  new,  and  is  therefore  not  a  research  issue. 

When  a  remote  process  needs  to  communicate  with  the 
initiating  process,  it  contacts  the  Client  Library,  which 
passes  the  information  on  to  the  initiating  process,  just  as 
though  the  remote  process  were  started  locally.  Being 
able  to  transparently  provide  this  service  to  applications, 
whether  or  not  they  are  command  interpreters,  requires 
that  the  Client  Library  intercept,  and  at  least  pre-process  if 
not  divert,  other  system  library  calls  in  addition  to  the 
previously  mentioned  exec  call.  For  example,  all  of  the 
socket  calls  and  all  calls  to  open,  close,  read,  and  write 
files  must  be  intercepted  and  replaced  or  at  least  pre-  and 
post-processed. 

The  MSHN  project  required  a  mechanism  for 
intercepting  these  calls  without  requiring  source 
modification.  We  initially  turned  to  the  Condor  project  for 
help  with  this  problem  [36].  Condor  is  a  project  at  the 
University  of  Wisconsin  that  performs  transparent 
migration  of  processes  in  a  Unix  environment.  To 
perform  this  migration.  Condor  also  had  to  intercept  these 
calls  to  system  libraries.  Using  techniques  similar  to  those 
used  by  Condor,  we  were  able  to  intercept  these  calls 
without  requiring  source  code  modification?  The 
mechanism  is  described  in  detail  elsewhere[43]. 


^  These  techniques,  however,  require  that  the  object  code  files  be  linked 
with  the  MSHN  Client  Library,  therefore  they  require  object  code  files. 
However,  another  tool,  the  Executable  Editing  Library  (EEL)  which 


In  addition  to  providing  a  mechanism  for  transparently 
executing  remote  processes,  the  Client  Library  is  in  a 
unique  position  to  passively  determine  the  status  of 
resources,  because  it  is  assumed  to  be  linked  with  any 
application  executing  in  an  environment  managed  by 
MSHN.  That  is,  the  MSHN  Client  Library  can  pre-  and 
post-process  system  calls,  because  it  is  intercepting  all 
such  calls  made  to  the  operating  system,  which  are 
executed  when  a  process  needs  to  use  a  hardware  resource. 
In  so  doing,  it  can  determine  the  low  level,  end-to-end 
QoS  that  an  application  is  receiving  from  a  particular 
resource.  We  will  discuss  this  functionality  of  the  Client 
Library  further  in  the  next  section. 

When  the  MSHN  Client  Library  intercepts  a  call  to 
execute  a  new  process,  it  must  have  some  way  of 
determining  which  resources  that  new  process  should  use, 
i.e.,  which  computer  should  primarily  be  responsible  for 
executing  the  new  process."*  Rather  than  requiring  that 
decision  to  be  made  independently  by  each  Client  Library 
that  is  linked  with  each  application,  we  chose  to  have  the 
Client  Library  first  check  the  request  against  a  list  of 
applications  managed  by  MSHN.  If  the  requested 
application  is  not  on  that  list,  the  MSHN  Client  Library 
simply  passes  the  requested  application  directly  to  the 
local  operating  system.  If  the  requested  application  is  on 
that  list,  it  instead  passes  the  request  to  the  MSHN 
Scheduling  Advisor.  It  is  the  Scheduling  Advisor^  job  to 
determine  which  set  of  resources  the  newly  requested 
process  should  use. 

The  MSHN  Scheduling  Advisor  is  itself  a  complex 
package,  associated  with  many  different  research  issues 
which  we  discuss  more  fully  in  the  next  section.  Among 
the  primary  research  issues  are:  (i)  what  criteria  should  be 
optimized  in  the  choice  of  resources?  (ii)  Because 
optimizing  the  criteria  is  likely  to  be  an  NP-complete 
problem,  if  n  is  too  large,  which  heuristic  should  be  used 
to  search  for  an  optimum  resource  assignment?  (iii)  With 
what  granularity  must  the  Scheduling  Advisor  model  both 
the  policies  and  protocols  associated  with  allocation  of  the 
lower  level  resources  and  what  granularity  of  model 
should  it  use  to  define  the  resource  requirements  of  a 
process? 

For  the  Scheduling  Advisor  to  determine  a  good 
assignment  of  resources  for  a  process,  it  must  know  both 
which  resources  and  how  much  of  each  resource  would  be 
required  for  a  process  to  execute  and  meet  its  QoS 
requirements  and  preferences.  Therefore,  to  assist  the 
Scheduling  Advisor  in  making  its  decision  as  to  the 


evolved  from  the  University  of  Wisconsin’s  Paradyn  project  could  be 
used  to  link  an  executable  with  the  MSHN  Client  Library,  instead[31]. 
^  In  modem  systems,  the  choice  of  computer  that  is  responsible  for 
executing  a  process  often  carries  with  it,  implicitly,  a  choice  of  file 
servers  and  other  distributed  resources  such  as  networks.  Therefore, 
when  we  say  that  MSHN  chooses  a  computer  to  be  responsible  for 
executing  a  process,  the  choice  of  other  resources  external  to  that 
computer  may  be  implicit  in  that  assignment. 
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assignment  of  resources,  we  designed  both  the  MSHN 
Resource  Requirements  Database  and  the  MSHN 
Resource  Status  Server. 

The  Resource  Status  Server  is  a  quickly  changing 
repository  that  maintains  information  concerning  the 
current  availability  of  resources.  Information  is  stored  in 
the  Resource  Status  Server  as  a  result  of  updates  from  both 
the  MSHN  Client  Library  and  the  MSHN  Scheduling 
Advisor.  The  Client  Library  can  update  the  Resource 
Status  Server  as  to  the  currently  perceived  status  of 
resources,  which  takes  into  account  resource  loads  due  to 
processes  other  than  those  managed  by  MSHN.  The 
Scheduling  Advisor  can  provide  expected  future  resource 
status  based  upon  the  resources  that  it  expects  will  be  used 
by  the  applications  that  it  assigns.  Additionally,  the 
Resource  Status  Server  can  statistically  process  its  historic 
knowledge  to  make  predictions  of  resource  status  even 
further  in  the  future. 

As  compared  to  the  Resource  Status  Server,  the 
information  maintained  by  the  MSHN  Resource 
Requirements  Database  changes  much  more  slowly.  The 
Resource  Requirements  Database  is  responsible  for 
maintaining  information  about  the  resources  that  are 
required  to  execute  a  particular  application.  Although  the 
initial  MSHN  prototype  only  implements  a  single  source 
for  the  information  stored  in  this  database  (statistically 
analyzed  historical  information),  we  envision  that  many 
other  on-going  research  projects  will  also  serve  as  sources 
for  this  information. 

MSHN’s  current  source  for  the  information  that  is 
maintained  by  the  Resource  Requirements  Database 
comes  from  data  collected  by  the  MSHN  Client  Library 
when  the  application  was  previously  executed.  Although 
patterned  after  SmartNet  in  this  way,  and  leveraging  the 
concept  of  compute  characteristics  that  SmartNet 
pioneered,  MSHN  does  not  collect  the  same  information 
as  SmartNet  collects.  SmartNet^  information  is  coarse¬ 
grained;  that  is,  it  maintains  only  the  total  amount  of  wall- 
clock  time  that  is  required  to  execute  a  program  from 
beginning  to  end  for  each  particular  machine.  This 
measure  is  sufficient  for  SmartNet’s  needs  due  to  the 
requirements  of  its  intended  applications  (three  phases) 
and  the  expected  environment  (each  job  has  exclusive 
access  to  the  resources  that  it  is  using).  However,  in 
MSHN,  resources  are  shared  and  applications  have  more 
phases,  so  maintaining  only  this  coarse  grain  information 
is  insufficient.  Therefore,  the  Resource  Requirements 
Database  has  the  ability  to  maintain  very  fine  grain 
information  collected  by  the  MSHN  Client  Library. 
Eventually  it  is  hoped  that  the  Resource  Requirements 
Database  can  also  be  populated  with  information  from 
smart  compilers  and  possibly  advice  from  application 
writers. 

Applications,  of  course,  are  needed  to  test  any  system. 
Unfortunately,  executables  for  many  different  platforms 


would  be  needed  to  test  MSHN^  ability  to  manage  them 
in  a  distributed,  heterogeneous  environment.  Producing 
such  actual  applications  would  require  tremendous  effort 
to  obtain  the  source  code  for  numerous  applications,  some 
of  which  may  be  classified  or  proprietary,  port  the  source 
code  to  the  different  platforms,  and  compile  and  link  them. 
We  decided  that  this  effort  was  better  spent  on  our 
research  system  itself,  so  we  looked  for  another  viable 
solution.  One  solution  that  we  considered  was  to  use 
benchmarks,  because  many  of  them  have  already  been 
ported  to  many  different  platforms.  However,  we  wanted 
to  make  sure  that  our  system  could  manage  a  wide  variety 
of  applications.  We  finally  settled  on  writing  a  general- 
purpose  application  emulator  whose  parameters  could  be 
specified  to  cause  it  to  imitate  a  wide  variety  of 
applications.  We  discuss  the  problem  of  deciding  how 
best  to  construct  such  an  emulator  under  the  research 
topics  in  the  next  section. 

The  Client  Library,  which  is  linked  with  each  executing 
MSHN  application,  informs  the  Resource  Status  Server 
about  the  current  perceived  status  of  the  resources  that  the 
applications  are  using.  The  Scheduling  Advisor  informs 
the  Resource  Status  Server  only  about  the  load  that  it 
expects  the  processes,  which  it  has  scheduled,  to  place  on 
certain  resources.  However,  neither  class  of  information 
indicates  the  condition  of  resources  that  no  MSHN 
application  is  currently  using  or  is  planning  on  using. 
Therefore,  we  use  a  MSHN  Application  Emulator  linked 
with  the  Client  Library  to  obtain  information  about  the 
condition  of  such  resources. 


Figurel  MSHN^  conceptual  architecture. 

MSHN^  conceptual  architecture  is  shown  in  Figure  1. 
As  can  be  seen  in  the  figure,  every  application  running 
with  MSHN  makes  use  of  the  MSHN  Client  Library  that 
intercepts  the  application’s  operating  system  calls.  When 
the  Client  Library  intercepts  a  request  to  execute  a  new 
application,  and  that  application  requires  that  the  MSHN 
Scheduling  Advisor  be  consulted  to  determine  the 
resources  that  the  application  should  use,  the  Client 
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Library  invokes  a  scheduling  request  on  the  Scheduling 
Advisor.  The  Scheduling  Advisor  queries  both  the 
Resource  Requirements  Database  and  the  Resource  Status 
Server.  It  uses  information  that  it  receives  from  them, 
along  with  an  appropriate  search  heuristic,  to  determine 
where  the  newly  requested  process  should  execute.  After 
determining  which  resources  should  host  the  new  process, 
the  Scheduling  Advisor  returns  the  decision  to  the  Client 
Library,  which,  in  turn,  requests  execution  of  that  process 
through  the  appropriate  MSHN  Daemon.  The  MSHN 
Daemon  invokes  the  application  on  its  machine.  As  a 
process  executes,  the  Client  Library  updates  both  the 
Resource  Status  Server  and  the  Resource  Requirements 
Database  with  the  current  status  of  the  resources  and  the 
requirements  of  the  process.  Meanwhile,  the  Scheduling 
Advisor  establishes  callbacks  with  both  the  Resource 
Requirements  Database  and  the  Resource  Status  Server. 
Using  callbacks,  the  Scheduling  Advisor  is  notified  in  the 
event  that  either  the  status  of  the  resources  has 
significantly  changed,  or  the  actual  resource  requirements 
are  substantially  different  from  what  was  initially  returned 
from  the  Resource  Requirements  Database.  In  either  case, 
if  it  no  longer  appears  that  the  assigned  resources  can 
deliver  the  required  QoS,  the  application  must  be  adapted 
or  terminated.  Upon  receipt  of  a  callback,  the  Scheduling 
Advisor  might  require  that  several  of  the  applications 
adapt  so  that  more  of  them  can  receive  their  requested  or 
desired  QoS. 
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Figure2  Physical  instantiation  of  the  MSHN 
architecture. 

Although  all  MSHN  components  could  run  on  the  same 
machine,  they  can  also  be  distributed  and  replicated  across 
many  different  computers  using  tools  such  as  ISIS,  Horus 
and  Ensemble  [7][49][48].  Results  from  control  theory 
will  also  be  useful  here  in  ensuring  that  the  process  of 
replicating  and  merging  components  is  stable  and  does  not 
result  in  oscillation.  Additionally,  results  from  control 
theory  must  be  incorporated  into  the  replicated  Scheduling 
Advisor  itself  to  ensure  that  modifications  requested  of 


adaptive  and  adaptation-aware  applications  do  not  become 
unstable.  MSHN  components  might  even  replicate  as 
needed  [20]  [21].  Figure  2  illustrates  a  simple 

instantiation  of  the  MSHN  system. 

In  addition  to  the  components  discussed  above,  we 
found  it  convenient  to  add  a  MSHN  Visualizer  that 
enabled  us  to  examine,  for  both  functional  and 
performance  debugging  purposes,  the  current  states  of  the 
various  MSHN  components.  The  MSHN  Visualizer 
captures  all  significant  events  within  and  between  the  core 
MSHN  components  for  real-time  and  post-mortem 
analysis. 

Security  within  the  MSHN  architecture  has  been 
considered.  Policies  of  interest  are: 

•  Component  authentication.  This  includes 
authentication  of  MSHN  core  components  to  each 
other;  authentication  of  resource-based  clients  to 
the  MSHN  core;  and  authentication  of  applications 
to  selected  MSHN  components. 

•  Hierarchical  least  privilege.  Within  the  MSHN 
context,  the  core  components  are  the  most 
privileged,  while  user  applications  are  the  least 
privileged. 

•  Communications  integrity  and  confidentiality. 
Communications  are  protected  from  unauthorized 
modification  and  disclosure. 

•  Access  control.  Access  to  MSHN  core  databases 
and  to  job  histories  may  be  mediated. 

The  security  architecture  creates  keyed  domains, 
supporting  least  privilege,  authentication,  confidentiality 
and  integrity  by  using  the  Common  Data  Security 
Architecture  facilities  for  security  services  and  key 
management  [57][58]. 

2.2.1.  The  current  MSHN  architecture.  A  high  level 
description  of  the  current  MSHN  architecture  is  presented. 
For  a  more  detailed  description,  we  refer  the  reader  to 
other  publications  [43].  High-level  diagrams  are  presented 
for  each  MSHN  component,  with  arrows  indicating  the 
direction  of  communication  or  action.  In  addition  to  these 
diagrams,  a  short  description  of  each  component^ 
functions  is  given.  In  the  description  of  the  MSHN 
architecture,  we  represent  MSHN  components  and 
external  components  as  Unified  Modeling  Language 
(UML)  actors  [8].  The  symbols  used  for  this 
representation  are  shown  in  Figure  3.  The  core  MSHN 
components  include  the  Scheduling  Advisor  (SA),  the 
Client  Library  (CL),  the  Resource  Status  Server  (RSS),  the 
Resource  Requirements  Database  (RRD),  the  Daemon  (D) 
and  the  Application  Emulator  (AE). 


’As  in  any  RMS,  assurance  of  MSHN's  security  properties  is  built  on  and 
limited  by  the  effectiveness  of  the  security  environment  provided  by  the 
underlying  operating  system(s)  and  hardware  base(s). 
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Figures  Symbois  representing  actors  in  the 
MSHN  architecture. 

Scheduline  Advisor  (SA)  functionality  The  primary 
responsibility  of  the  SA  is  to  determine  the  best 
assignment  of  resources  to  a  set  of  applications,  based  on 
the  optimization  of  a  global  measure,  which  we  describe  in 
the  next  section.  The  SA  depends  on  the  RRD  and  the 
RSS  in  order  to  identify  an  operating  point  that  optimizes 
the  global  measure.  It  responds  to  resource  assignment 
requests  from  the  CL.  When  appropriate,  the  SA  requests 
application  adaptations  via  the  CL.  The  SA  is  also 
responsible  for  establishing  callback  criteria  (thresholds) 
with  the  RSS  and  RRD.  All  MSHN  components  update 
the  MSHN  Visualizer  with  all  significant  display  and  post¬ 
mortem  analysis  events. 


Client  Library  (CL)  functionality.  The  CL  is  linked 
with  both  adaptive  and  adaptation-aware  applications.  It 
provides  a  transparent  interface  to  all  of  the  other  MSHN 
components.  The  CL  intercepts  system  calls  to  collect 
resource  usage  and  status  information,  which  is  forwarded 
to  the  RRD  and  the  RSS.  The  CL  also  intercepts  calls  that 
initiate  new  processes  (such  as  exec  { ) )  and  consults  the 
SA  for  the  best  place  to  start  that  process.  It  requests 
(possibly  remote)  daemons  to  execute  applications  based 
on  the  SAS  advice.  The  CL  invokes  adaptation  on 
adaptation-aware  applications  when  notified  by  the  SA  via 
callbacks.  One  such  invocation  is  the  special  case  of 
setting  emulator  parameters. 


Resource  Status  Server  (RSS)  functionality.  The  role 
of  the  RSS  is  to  maintain  a  repository  of  the  three  types  of 
information  about  the  resources  available  to  MSHN: 
relatively  static  (long-term),  moderately  dynamic 
(medium-term),  and  highly  dynamic  (long-term) 
information.  The  RSS  is  updated  with  current  data  via  the 
CL  or  through  a  system  administrator.  The  RSS  responds 
to  SA  requests  with  estimates  of  currently  available 
resources.  The  SA  sets  up  callbacks  with  the  RSS  based 
on  resource  availability  thresholds  and  CL  update 
frequency  requirements. 


functionality.  The  RRD  is  a  repository  of  information 
pertaining  to  the  resource  usage  of  applications.  The  RRD 
provides  this  information  to  the  SA.  Callbacks  to  the  SA 
are  based  on  either  the  occurrence  of  a  threshold  violation 
or  update  frequency  requirements.  It  is  updated  by  the 
CL. 


Daemon  fP)  functionality.  The  MSHN  Daemon 
executes  on  all  compute  resources  available  for  use  by  the 
SA.  Its  sole  purpose  is  to  start  applications  as  requested 
by  the  CL.  It  therefore  has  the  capability  and 
responsibility  of  initiating  the  default  application  emulator 
at  start-up  to  determine  resource  status  information. 
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Application  Emulator  (AE)  functionality.  The  AE 
emulates  a  running  application  by  stressing  particular 
resources  in  the  same  way  as  the  real  application  does. 
The  AE  serves  two  purposes:  The  first  is  to  run  simulated 
applications  (that  statistically  leave  the  same  resource 
usage  footprint  of  the  real  applications)  without  the 
overhead  and  uncertainty  of  actually  installing, 
maintaining,  and  running  that  particular  application.  The 
second  is  to  be  a  monitor,  in  the  absence  of  any  other 
MSHN-scheduled  applications.  That  is,  it  can  determine 
the  status  of  resources  that  are  not  being  otherwise  used  by 
MSHN-scheduled  applications,  and  therefore  not  being 
monitored  by  an  existing  CL.  The  Daemon  starts  one 
instance  of  the  AE,  by  default,  at  startup.  Other  instances 
may  be  started  at  any  other  time  through  a  command 
interpreter  or  other  application. 


3.  MSHN  Research  Issues 

In  this  section  of  the  paper,  we  describe  some  of  the 
major  issues  being  investigated  by  the  MSHN  team 
members.  We  also  briefly  summarize  some  of  the  results 
to  date.  Of  course,  there  is  not  sufficient  space  to 
completely  describe  all  of  the  issues  and  results  in  detail, 
so  the  reader  is  also  referred  to  relevant  papers  on  each 
topic.  We  have  attempted  to  associate  the  issues  with  the 
component  of  the  MSHN  architecture  that  they  most 
strongly  affect.  However,  certainly  many  issues  that  affect 
the  Scheduling  Advisor  also  affect  the  Resource  Status 
Server  and  Resource  Requirements  Database. 
Additionally,  this  work  is  non-orthogonal  to  research 
being  done  by  many  investigators  outside  of  the  MSHN 
team  who  are  examining  such  issues  as  how  QoS 
requirements  are  derived  from  smart  compilers  and  how 
they  can  be  best  expressed. 

3.1.  Scheduling  Advisor  research  issues 

In  this  section  we  discuss  some  issues  that  most 
strongly  affect  the  Scheduling  Advisor.  First,  we  examine 
how  to  quantify  the  needs  of  all  of  the  processes  that 
require  resource  allocation  by  the  Scheduling  Advisor. 


Then,  we  consider  the  ramifications  of  not  precisely 
knowing  the  resource  requirements,  and  consequently,  the 
exact  future  status  of  all  of  the  resources.  Finally,  we 
discuss  the  class  of  heuristics  that  have  thus  far  been 
implemented  in  MSHN  and  why  there  is  a  need  for  a 
variety  of  heuristics. 

3.1.1.  Optimization  criteria.  Optimal  resource 
allocation  always  involves  attempting  to  solve  an 
optimization  problem,  which  is  usually  NP-complete. 
SmartNet^  primary  optimization  criterion  was  to 
minimize  the  time  at  which  an  application  completes, 
assuming  that  all  of  the  applications  were  of  a  particular 
form.  Later  versions  of  SmartNet  also  accounted  for 
priorities.  MSHN  maximizes  a  weighted  sum  of  values 
that  represents  the  benefits  and  costs  of  delivering  the 
required  and  desired  (JoS  (including  security,  priorities, 
and  preferences  for  versions),  within  the  specified 
deadlines,  if  any.  We  now  discuss  the  effect  of  each  of 
these  attributes  on  the  optimization  criteria. 

•  MSHNk  consideration  of  security  as  an 
optimization  criterion  allows  the  trade-off  of 
security  with  other  QoS  constraints  when  there  are 
insufficient  resources  to  complete  all  requests. 
This  is  done  in  a  fashion  similar  to  other  recent 
projects  [45].  MSHN  associates  a  cost  to  security 
levels  that  varies,  depending  upon  which  resources 
are  being  used  to  obtain  a  given  level  of  security 
(for  more  details  on  security  viewed  as  a  QoS 
parameter,  see  section  3.2). 

•  MSHN  attempts  to  account  for  both  preferences  for 
various  versions  and  priorities.  That  is,  when  it  is 
impossible  to  deliver  all  of  the  most  preferred 
information  within  the  specified  deadlines  due  to 
insufficient  resources,  MSHN^  optimization 
criteria  are  designed  to  favor  delivering  the  most 
preferred  version  to  the  highest  priority 
applications. 

•  In  MSHN’s  optimization  criteria,  deadlines  can  be 
simple  or  complex.  That  is,  sometimes  a  user  will 
be  satisfied  if  a  result  is  received  before  a  specific 
time.  Other  times,  a  user  would  like  to  associate  a 
more  general  benefit  function,  which  would 
indicate  that  information  might  have  different 
values  based  upon  when  it  is  received. 

Further  information  about  MSHNk  optimization 
criteria  can  be  found  elsewhere  [23]  [30] . 

In  addition  to  a  cost  function  that  is  optimized, 
optimization  problems  usually  have  a  set  of  constraints 
that  must  be  met  in  order  for  a  solution  to  be  viable.  The 
constraints  of  a  resource  allocation  optimization  problem 
are  that  the  resources  allocated  to  meet  the  needs  of  the 
processes  must  be  less  than  or  equal  to  the  available 
resources  at  any  point  in  time.  The  actual  inequalities 
required  not  only  depend  upon  the  QoS  constraints,  but 
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also  upon  the  sharing  policies  used  by  the  local  operating 
systems  and  network  protocols,  and  upon  the  granularity 
with  which  both  those  policies  and  resource  usage  should 
be  known  (see  Granularity  Issues  in  Section  3.2), 

3.1.2.  Inexact  knowledge  of  job  resource  usage.  Even 
if  it  is  possible  to  find  a  perfect  solution  to  the 
optimization  problem  that  is  posed  by  instantiating  the 
constraints  and  optimization  criteria  to  the  current 
situation,  the  expected  resource  usage  of  any  given 
application  is  often  only  an  estimate.  In  real-time  systems, 
the  worst  case  estimate  is  often  used  to  assign  resources  to 
processes;  however,  many  other  systems  use  the  mean 
expected  resource  usage.  Our  recent  analysis  has  revealed 
that  using  the  mean  will  cause  the  actual  run-time  to  be 
generally  underestimated  and  that  a  better  assignment  can 
be  made  if  both  the  mean  and  distribution  of  the  expected 
resource  usage  is  accounted  for,  when  appropriate  [26]. 

This  leads  to  another  question  concerning  whether  the 
extra  complexity  involved  in  using  a  sophisticated 
heuristic  will  yield  a  better  schedule  than  using  a  simple 
heuristic  if  the  actual  variance  of  run-times  is  large,  and 
scheduling  is  done  using  the  mean,  or  both  the  mean  and 
the  distribution.  Our  recent  results  in  this  area  have  shown 
that,  in  many  cases,  complex  heuristics  can  determine 
schedules  that,  when  executed,  sometimes  perform  much 
better  than  the  schedules  derived  from  very  simple 
heuristics,  even  when  the  variance  is  large.  However, 
sometimes  very  simple  heuristics  perform  just  as  well  as 
the  more  complex  ones.  The  difference  in  quality  of  the 
schedules  produced  by  the  various  heuristics  was  found  to 
be  closely  correlated  with  the  type  of  heterogeneity  in  a 
system.  For  example,  when  both  the  machine  and 
application  heterogeneity  is  very  low,  a  simple  heuristic 
performs  just  as  well  as  more  complex  ones.  Several 
papers  have  described  our  results  concerning  this  research 
[2][3][10][40]. 

3.1.3.  Performance  of  search  algorithms.  SmartNet^ 
organization  leveraged  the  idea  of  independence  of  search 
algorithms  and  optimization  criteria.  That  is,  most 
heuristics  for  searching  the  space  of  mappings  can  be 
modified  to  search  for  solutions  to  different  optimizations 
within  the  same  space.  For  example,  Dantzig^  Simplex 
Method  is  useful  with  all  problems  whose  optimization 
criteria  and  constraint  inequalities  can  be  stated  using  only 
linear  combinations  of  the  variables.  Sometimes,  many 
different  heuristics  will  work,  but,  depending  upon  the 
characteristics  of  a  given  problem,  certain  heuristics  may 
be  preferable  to  others.  For  example,  the  MSHN  team  has 
obtained  extensive  results  identifying  the  regions  of 
heterogeneity  where  certain  heuristics  perform  better  than 
others  for  maximizing  throughput  by  minimizing  the  time 
at  which  the  last  application,  of  a  set  of  applications, 
should  complete  [2]  [3]  [10]  [40],  Re-targeting  of  these 


heuristics  to  other  optimization  criteria  is  currently 
underway. 

Additionally,  MSHN  team  members  have  performed 
extensive  research  into  accounting  for  dependencies 
between  applications  or  processes  that  make  up  a  single 
application  [40]  [51]  [52]  [54].  This  includes  promising 
results  from  investigating  data  dependencies  and  mapping 
of  iterative  applications  [  1  ]  [4]  [5]  [6]  [  1 1  ] . 

3.2.  Resource  Status  Server  and  Resource 
Requirements  Database  research  issues 

Part  of  the  MSHN  team^  investigation  has  been  aimed 
at  determining  what  information  should  be  stored  in  the 
Resource  Requirements  Database  and  maintained  by  the 
Resource  Status  Server.  First,  a  taxonomy  for  the  types  of 
information  that  could  be  stored  there  was  required.  We 
discuss  this  taxonomy  below.  We  also  discuss  the  impact 
that  viewing  security  as  a  QoS  has  on  these  two  MSHN 
components.  Finally,  one  of  the  most  important  issues  in 
designing  effective  RMSs  is  determining  the  level  of 
granularity  of  information  that  must  be  maintained 
concerning  the  status  of  resources  and  the  requirements  of 
applications.  We  now  discuss  each  of  these  issues  in 
somewhat  more  detail  and  refer  the  interested  reader  to 
relevant  publications. 

3.2.1.  A  taxonomy.  The  MSHN  team  has  formulated  a 
three-part  taxonomy  for  classifying  systems.  The  three 
different  components  include  methods  for  describing  the 
applications,  the  computing  environment,  and  the  mapping 
strategy  that  is  used.  Some  of  the  relevant  characteristics 
that  need  to  be  instantiated  concerning  each  application 
include 

(i)  Its  size,  that  is  the  number  of  tasks  or  sub-tasks 
associated  with  it. 

(ii)  Whether  the  sub-tasks  are  independent  of  one 
another  or,  if  they  are  dependent,  the  types  of 
dependencies. 

(iii)  The  I/O  distributions  of  the  application  and  the 
sources  of  the  I/O,  i.e.,  whether  it  performs  all 
input  in  the  beginning  and  all  output  at  the  end  or 
whether  one  or  the  other  is  performed 
continually  throughout  the  lifetime  of  the 
processes  and  whether  the  input  data  is  obtained 
through  interacting  with  a  person  or  some  other 
source  that  has  highly  variable  response  times. 

(iv)  The  deadlines  and  other  QoS  requirements, 
including  security,  if  any,  associated  with  the 
applications  and/or  the  subtasks  that  comprise 
the  application. 

Similarly,  the  computing  environments  and  mapping 
strategies  have  numerous,  hierarchically  characterizable, 
attributes  that  are  more  fully  documented  in  other 
publications  [9]. 
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3.2.2.  Security  as  a  quality  of  service.  Security  in  the 
context  of  QoS  is  a  current  research  area  [34][45].  The 
security  capabilities  of  resources  and  security 
requirements  of  applications  must  influence  the 
assignment  of  applications  to  resources.  We  can  obtain 
information  concerning  the  user  security  requirements 
from  the  Resource  Requirements  Database  and 
information  concerning  the  security  capabilities  of  the 
resource  from  the  Resource  Status  Server.  For  example,  if 
the  output  of  an  application  must  be  encrypted  using  a 
particular  algorithm,  with  a  key  size  chosen  within  a 
particular  range,  then  that  requirement  must  be  stored  in 
the  Resource  Requirements  Database  along  with  the 
amount  of  data  that  must  be  encrypted.  Also,  the 
Resource  Status  Server  must  know  whether  each  particular 
computing  resource  is  capable  of  performing  the  required 
cryptographic  algorithm  and  the  cost,  in  terms  of  run-time 
per  byte,  for  example,  of  encrypting  the  data.  Members  of 
the  MSHN  team  have  developed  an  initial  framework, 
which  they  are  currently  refining,  for  characterizing  the 
overall  security  attributes  of  a  network  and  for 
determining  a  cost  and  benefit  value  for  providing 
required  and  preferred  security  to  an  application 
[26][27][33][33]. 

3.2.3.  Granularity  issues.  Another  very  important 
question  that  concerns  both  the  Resource  Requirements 
Database  and  the  Resource  Status  Server  has  to  do  with 
how  much  detail  should  be  maintained  concerning  the 
status  of  resources  and  the  requirements  of  applications. 
Obviously,  while  a  very  accurate,  detailed  set  of 
information  might  prove  quite  useful  to  the  scheduling 
algorithms,  it  would  be  at  the  least  very  expensive  and 
difficult  to  collect  if  not  expensive  to  process  within  the 
algorithm  itself 

The  MSHN  team  has  obtained  initial  estimates  for  the 
overhead  of  capturing  system  calls  to  determine  the  cost  of 
collecting  various  granularities  of  such  information  [43]. 
Members  of  the  team  are  currently  using  this  technique  to 
record  fine-grained  information  for  a  program  that 
analyzes  air  tasking  orders  and  will  report  both  the 
information  concerning  the  resources  that  were  used,  as 
well  as  the  overhead  involved  in  collecting  the  resource 
usage  information  [40]. 

In  addition  to  the  cost  associated  with  collecting  fine¬ 
grained  information  concerning  applications’  use  of 
resources,  there  is  the  question  of  how  much  information 
is  sufficient.  Current  experiments  of  the  MSHN  team 
focus  on  determining  whether  fairly  simple  models  can  be 
used  to  predict  the  relative  performance  of 
application/resource  assignments.  To  perform  realistic 
experiments,  the  team  has  built  an  initial  application 
emulator  (see  below)  and  is  actually  executing  it  with 
different  parameters  on  different  systems,  using  all 


possible  configurations  to  compare  the  actual  received 
QoS  to  the  predicted  QoS.  Thus  far  we  have  determined 
that  the  Resource  Status  Server  must,  directly  or 
indirectly,  contain  information  concerning  whether  native 
threads  are  supported  by  the  operating  system.  If  this 
information  is  not  maintained,  the  scheduling  algorithm, 
which  must  choose  between  two  platforms  that  are 
identical  except  for  the  operating  system  version  that  they 
execute,  may  assign  a  process  which  could  be  handled 
better  by  one  platform  to  the  other.  Similarly,  the 
Resource  Requirements  Database  must  indicate  whether  or 
not  the  application  is  multi-threaded  and  the  number  and 
nature  of  threads  that  it  uses.  Information  concerning 
these  results  can  be  found  in  other  publications [1 1]. 

3,3.  Application  Emulator  research  issues 

The  MSHN  team  is  designing  and  implementing  an 
application  emulator  for  two  different  reasons.  One  reason 
is  that  it  is  needed  within  the  MSHN  architecture  to 
monitor  the  end-to-end  status  of  the  resources.  The  other 
reason  is  to  be  able  to  easily  construct  a  very  large  suite  of 
application  emulators  that  place  loads  on  resources  in  the 
same  way  that  the  actual  applications  would.  When  used 
in  conjunction  with  resource  usage  measurements  from 
linking  actual  applications  to  MSHNS  Client  Library,  the 
MSHN  Application  Emulator  can  be  used  to  emulate  the 
execution  of  the  actual  applications  without  requiring  the 
applications  to  actually  be  ported  to  many  different 
platforms.  The  obvious  advantage  of  using  such  an 
application  emulator,  rather  than  porting  the  applications 
themselves,  is  to  enable  the  MSHN  researchers  to  test  their 
architecture  more  quickly  under  many  different  situations. 

To  meet  the  first  purpose  of  the  MSHN  Application 
Emulator,  we  first  had  to  define  the  meaning  of  loading 
resources  for  various  resources.  Percentages  cannot  be 
used,  as  they  are  not  transferable  between  either 
computing  platforms  or  network  media.  Rather,  each 
category  of  resource  was  identified  and  units  that  can  be 
most  easily  translated  between  different  platforms,  such  as 
FLOPS  and  bytes/sec,  were  chosen  to  quantify  resource 
use.  Also  recognized  at  this  stage  was  the  need  to  have 
both  multi-threaded  and  non-multi-threaded  application 
emulator  capability.  Finally,  not  only  can  a  single 
application  be  comprised  of  multiple  threads,  but  it  can 
also  be  comprised  of  multipleheavy- weight  processes. 

When  designing  the  Application  Emulator  to  meet  both 
of  its  requirements,  we  recognized  that  distributions 
reflecting  communication  and  computation  alone  were 
insufficient;  conditional  probabilities  were  required.  That 
is,  many  times  the  purpose  of  one  process  sending  a 
message  to  another  process  is  so  that  the  receiving  process 
will  perform  work  on  behalf  of  the  sending  process. 
Therefore,  we  designed  our  most  general  emulator  to  also 
have  the  capability  of  sending  work-bearing  messages. 
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To  this  end,  we  have  completed  an  initial 
implementation  of  an  application  emulator  that  we  have 
used  for  our  granularity  research  and  are  testing  the  more 
general  application  emulator.  Documentation  concerning 
both  of  these  application  emulators  can  be  found 
elsewhere  [11][15]. 

3.4.  Client  Library  research  issues 

The  research  issues  having  to  do  with  the  Client 
Library  component  involve  both  mechanism  and  policy. 
The  mechanism  issues  have  to  do  with  how  to 
transparently  link  the  Client  Library  with  applications. 
Previous  research  in  the  areas  of  process  migration  and 
tools  for  debugging  parallel  and  distributed  programs 
provide  us  with  easy  solutions,  as  mentioned  earlier. 
Therefore,  the  only  issue  that  remains  is  how  best  to 
transparently  determine  the  end-to-end  availability  of 
resources.  First,  simply  determining  that  the  Client 
Library  could  perform  this  functionality  better  than 
providing  the  functionality  external  to  the  applications 
themselves  is  an  important  contribution.  However, 
determining  the  average  end-to-end  availability  of  a 
network  resource  is  not  a  trivial  problem.  The  MSHN 
team^  initial  progress  in  this  area  has  already  been 
detailed  elsewhere  [43]. 

4.  Summary  and  future  work 

In  this  paper  we  summarized  the  purpose  of  a  resource 
management  system  (RMS)  in  general  and  the  research 
goals  of  one  particular  experimental  RMS,  the 
Management  System  for  Heterogeneous  Networks 
(MSHN).  Motivation  was  provided  for  all  of  the  major 
components  of  MSHN,  and  the  architecture  that  contains 
those  components  was  explained.  Some  of  the  research 
questions  that  the  MSHN  researchers  are  seeking  answers 
to  were  described.  References  were  provided  that  enable 
the  reader  to  better  understand  MSHN,  and  to  learn  more 
about  the  MSHN  experiments.  There  are  many  other 
interesting  RMS  research  projects  in  progress  today,  but 
space  permitted  us  to  survey  only  a  few  of  them.  In 
addition  to  continuing  the  on-going  experiments  described 
in  the  paper,  future  MSHN  investigation  will  focus  on  0 
reaching  a  better  understanding  of  the  level  of  granularity 
obtainable  from  applications  and  the  level  required  to 
perform  sufficiently  good  resource  assignment;  (ii)  more 
detailed  characterization  of  security  costing  and  metrics; 
and  (iii)  determining  the  best  search  algorithms  to  use  for 
the  MSHN  optimization  criteria  under  various  conditions. 
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Abstract  1  Introduction 


This  project  explores  the  development  of  a  hard¬ 
ware/software  infrastructure  to  enable  the  provision 
of  quality  of  service  (QoS)  guarantees  in  high  perfor¬ 
mance  networks  used  to  configure  clusters  of  worksta¬ 
tions/PCs,  These  networks  of  workstations  (NOWs) 
have  emerged  as  a  viable  high  performance  computa¬ 
tional  vehicle  and  are  also  being  called  upon  to  support 
access  to  multimedia  datasets.  Example  applications 
include  Web  servers^  video- on- demand  servers^  im¬ 
mersive  environments,  virtual  meetings,  multi-player 
3-D  games,  interactive  simulations,  and  collaborative 
design  environments.  Such  applications  must  often 
share  the  interconnect  with  traditional  compute  inten¬ 
sive  parallel/distributed  applications  that  are  usually 
driven  by  latency  requirements  in  contrast  to  jitter, 
loss  rate,  or  throughput  requirements.  The  challenge 
is  to  develop  a  communication  infrastructure  that  ef¬ 
fectively  manages  the  network  resources  to  enable  the 
diverse  QoS  requirements  to  be  met.  The  major  com¬ 
ponents  of  QUIC  include  (1)  use  of  powerful,  proces¬ 
sors  embedded  in  the  network  interfaces,  (2)  scheduling 
paradigms  for  concurrently  satisfying  distinct  QoS  re¬ 
quirements  over  multiple  streams,  (3)  re- configurable 
hardware  support  to  enable  complex  scheduling  deci¬ 
sions  to  be  made  in  the  desired  time  frames,  and  (j) 
a  flexible  and  extensible  virtual  communication  ma¬ 
chine  that  provides  a  uniform  interface  for  dynamically 
adding  hardware /software  functionality  to  the  network 
interfaces  (NIs).  This  papers  reviews  the  goals,  ap¬ 
proach  and  current  status  of  this  project. 


•This  work  is  supported  in  part  by  DARPA  through  the  Hon¬ 
eywell  Technology  Center  under  contract  numbers  B09332478 
and  B09333218,  the  British  Engineering  and  Physical  Sciences 
Research  Council  with  grant  number  92600699,  Intel  Corpora¬ 
tion,  and  the  WindRiver  Systems  University  Program. 
iNow  with  IBM  T.  J.  Watson  Research  Laboratories 


The  continuing  rapid  decrease  in  the  cost  of  both  pro¬ 
cessor  and  network  components  has  led  to  a  tighter 
integration  of  computation  and  communication.  The 
result  has  been  an  explosion  of  network-based  applica¬ 
tions  characterized  by  the  processing  and  delivery  of 
continuous  data  streams  and  dynamic  media  [1,  6]  in 
addition  to  servicing  static  data.  Numerous  examples 
can  be  drawn  from  web-based  applications,  interactive 
simulations,  gaming,  visualization,  and  collaborative 
design  environments.  The  changing  nature  of  the  work¬ 
load  and  cost /performance  tradeoffs  has  prompted  the 
development  of  a  new  generation  of  scalable  media 
servers  structured  around  networks  of  workstations  in¬ 
terconnected  by  high  speed  system  area  network  (SAN) 
fabrics.  However  the  ability  to  construct  scalable  clus¬ 
ters  that  can  serve  both  static  and  dynamic  media  is 
predicated  on  successfully  addressing  two  major  issues. 

The  first  is  that  node  architectures  are  based  on 
a  CPU-centric  model  optimized  for  uniprocessor  or 
small-scale  multiprocessor  applications.  This  can  lead 
to  significant  inefficiencies  for  distributed  applications. 
Specifically,  while  CPU  and  wire  bandwidths  have  been 
increasing  rapidly  over  the  years,  memory  and  intra- 
node  I/O  bandwidths  continue  to  improve  at  much 
slower  rates,  resulting  in  a  performance  gap  that  will 
continue  to  widen  in  the  foreseeable  future.  This  im¬ 
plies  that  interactions  between  the  network  and  hosts 
utilizing  main  memory  are  expensive.  Additional  costs 
arise  for  such  interactions  from  overheads  due  to  I/O 
bus  usage  [4,  5],  communication  protocol  implementa¬ 
tions  (e.g.,  if  interrupts  are  used  [2]),  and  interactions 
with  the  host  CPU’s  memory  management  and  caching 
infrastructure  [3],  Consequently,  network-based  appli¬ 
cations  that  produce,  transport,  and  process  large  data 
sets  suffer  substantial  losses  in  performance  when  these 
data  sets  must  be  moved  through  the  memory  and  I/O 
hierarchies  of  multiple  nodes. 
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The  second  issue  is  the  workload  and  performance 
characteristics  of  this  new  generation  of  network-based 
applications.  Data  types,  processing  requirements  and 
performance  metrics  have  changed  placing  new  func¬ 
tional  demands  on  the  systems  that  serve  them.  Sev¬ 
eral  key  attributes  are  as  follows  [1,  6]. 

1.  Real-Time  Response:  Interactivity  and  re¬ 
quirements  on  predictability,  such  as  when  to  ser¬ 
vice  video  streams,  make  real-time  response  im¬ 
portant.  Such  timing  constraints  cannot  be  met 
unless  the  network  resources  can  be  scheduled  and 
allocated  effectively. 

2.  Shift  from  Quantitative  to  Qualitative  Met¬ 
rics:  With  the  new  applications  there  has  also 
been  a  shift  in  metrics  that  define  their  perfor¬ 
mance  [1,  7].  Traditional  quantitative  metrics 
such  as  latency  and  bandwidth  give  way  to  more 
qualitative  metrics  such  as  jitter  and  real-time  re¬ 
sponse.  The  smooth  update  of  a  video  stream  or 
the  response  time  to  a  user  access  request  is  more 
important  than  minimizing  transmission  latency. 
The  metrics  clearly  affects  the  choice  of  implemen¬ 
tation  techniques.  For  example  packets  may  be 
dropped  in  a  video  or  audio  stream  without  com¬ 
promising  quality  whereas  this  would  be  unaccept¬ 
able  for  most  transaction  processing  applications. 

3.  Hardware  Limitations:  The  service  and  trans¬ 
fer  of  video  streams  and  images  places  increasing 
demands  on  the  memory  and  I/O  band  widths  at 
a  time  when  the  bandwidth  gap  between  the  CPU 
and  memory  and  I/O  subsystems  is  growing.  Fur¬ 
thermore  the  available  physical  bandwidth  to  the 
desktop  and  to  the  home  will  grow  by  several  or¬ 
ders  of  magnitude.  Wire  band  widths  into  the  clus¬ 
ter  that  serves  these  machines  with  media  will  fol¬ 
low  or  lead  this  trend.  This  will  exacerbate  the 
“bandwidth  gap”  between  the  CPU  and  the  wire, 
thereby  making  qualitative  metrics  more  sensitive 
to  the  same. 

4.  Heterogeneity:  Systems  such  as  real-time  me¬ 

dia  servers  need  to  service  hundreds  and,  possi¬ 
bly  thousands,  of  clients  typically  each  with  their 
own  quality  of  service  (QoS)  requirements, 

such  as  packet  dropping  vs,  reliable  message 
transmission,  low  latency  vs.  jitter,  or  throughput 
vs.  latency.  We  must  concurrently  meet  diverse 
service  requirements  with  the  same  set  of  hard¬ 
ware  and  software  resources. 

The  QUIC  project  studies  the  issues  inherent  in 
Quality  of  Service  management  for  cluster  machines. 


Our  project  focuses  on  the  functionality  of  the  network 
messaging  layer  in  providing  QoS  guarantees.  Our  ap¬ 
proach  is  based  on  the  development  of  an  extensible 
QoS  management  infrastructure  for  which  we  carefully 
select  components  that  are  to  be  implemented  within 
programmable  network  interfaces  (NIs).  In  Section  2 
we  provide  a  brief  overview  of  the  project  while  a  de¬ 
scription  of  the  overall  Quality  Management  infrastruc¬ 
ture  can  be  found  in  Section  3.  The  rest  of  the  paper 
describes  the  approach  taken  in  the  implementation  of 
each  of  key  QUIC  components. 

2  Project  Overview  and  Goals 


Cluster  Environment  Computing  Node 


Figure  1:  An  Overview  of  the  QUIC  Development  In¬ 
frastructure. 


This  project  has  a  strong  experimental  component  and 
therefore  our  infrastructure  is  biased  towards  rapid 
prototyping  and  evaluation.  An  overview  of  the  QUIC 
infrastructure  is  illustrated  in  Figure  1.  Our  devel¬ 
opment  environment  is  a  cluster  of  16  quad- Pentium 
Pro  nodes  interconnected  via  IntePs  i960  based  In¬ 
telligent  I/O  (120)  network  interface  cards  running 
the  Vx Works  operating  system  and  interconnected  via 
100  Mbits/sec  Ethernet.  Concurrent  development  pro¬ 
ceeds  under  the  Windows  NT  and  Solaris  node  operat¬ 
ing  system  environments.  Individual  nodes  have  mul¬ 
tiple  network  interfaces,  multiple  CPUs  and  eventually 
will  coordinate  services  from  multiple  nodes[13]. 

Our  goal  is  the  development  of  an  effective  qual¬ 
ity  management  infrastructure  that  can  service  a  large 
number  of  connections,  each  with  distinct  service  re- 
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quirements,  while  minimizing  host-NI  interactions  for 
NIs  whose  functionality  can  be  easily  and  dynamically 
extended.  Our  approach  also  investigates  the  extension 
of  NI  functionality  to  include  computations  on  data 
streams  ets  they  pass  through  the  network  interface. 
By  performing  such  stream  computations  ’on  the  fly’  as 
data  is  passed  through  the  interface  we  can  avoid  costly 
traversals  of  the  memory  hierarchy  and  thereby  ob¬ 
tain  flner  control  over  service  quality,  for  example  real¬ 
time  response.  We  are  investigating  supporting  such 
stream  computations  through  the  use  of  programmable 
hardware  in  the  form  of  dynamically  configurable  field 
programmable  gate  arrays  (FPGAs)  within  the  NIs. 
By  placing  quality  management  functionality  “close” 
to  the  wire  and  I/O  components  attached  to  the  120 
boards,  we  expect  to  enable  QoS  guarantees  at  higher 
levels  of  resource  utilization  than  commodity  clusters 
will  otherwise  permit.  The  following  sections  describe 
the  individual  aspects  of  the  QUIC  project. 


3  QUIC  Quality  Management 
Infrastructure 

The  QUIC  QoS  infrastructure  has  several  components 
that  jointly  permit  the  implementation  of  a  variety  of 
quality  management  functions  and  policies  applied  to 
the  information  streams  into  and  out  of  CPUs. 

•  At  the  core  of  QUIC  resides  the  extensible  NI  soft¬ 
ware  architecture,  which  is  designed  for  runtime 
extension  with  functions  that  manage  the  infor¬ 
mation  streams  used  by  particular  applications. 
Two  types  of  functions  are  supported:  (1)  stream- 
lets  that  operate  on  the  contents  of  data  being 
streamed  out  of  or  into  hosts,  via  the  NIs  on  which 
the  QUIC  infrastructure’s  core  components  reside, 
and  (2)  scheduling  or  quality  management  func¬ 
tions  that  manage  such  streams,  typically  by  ap¬ 
plying  certain  scheduling  algorithms  to  streams  as 
they  pass  through  NIs  and  the  Nl-host  interface. 
This  paper’s  principal  focus  is  on  these  scheduling 
functions  and  on  the  manner  in  which  they  are 
applied  to  streams. 

•  QUIC  offers  two  types  of  interfaces  to  applica¬ 
tions:  (1)  a  communication  interface  that  directly 
supports  its  information  streams,  using  standard 
communication  protocols  enhanced  with  the  abil¬ 
ity  to  specify  scheduling  functions  or  streamlets 
applied  to  them,  and  (2)  an  extension  interface 
using  which  applications  can  define  new  quality 
management  (scheduling)  functions  or  streamlets 
to  be  applied  to  their  information  streams. 
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Figure  2:  Architecture  of  the  120  Network  Interface. 


•  QUIC  also  offers  functions  within  each  NI  that 
permit  the  coordination  among  multiple  NIs  that 
jointly  operate  on  certain  information  streams  or 
cooperate  to  support  applications.  This  ‘control 
layer’  of  QUIC  is  also  described  below. 

These  functional  components  of  QUIC  are  reviewed  in 
the  following  sections. 

3.1  Network  Interface  Architecture 

Hardware  Platform  Intel’s  IQ80960RD66  Evalua¬ 
tion  Platform  Board  serves  as  the  underlying  hardware 
upon  which  the  QUIC  NI  software  architecture  is  be¬ 
ing  instantiated.  The  board  is  designed  to  be  a  testbed 
for  systems  that  offload  I/O  processing  from  the  host. 
The  architecture  of  the  network  interface  is  shown  in 
Figure  2,  The  board  resides  in  a  PCI  slot  on  the  host 
machine  and  provides  two  network  ports  and  two  SCSI 
ports  on  an  isolated  PCI  bus.  The  cards  single  pro¬ 
cessing  unit  is  an  Intel  i960RX  processor  running  at 
66Mhz  providing  sufficient  compute  power  to  experi¬ 
ment  with  the  movement  of  non-trivial  computations 
to  the  interface[14,  15]. 

The  quality  management  infrastructure  executing 
within  this  NI  will  be  layered  on  Vx Works,  a  real¬ 
time  operating  system  from  Wind  River  Systems.  Our 
motivation  for  using  VxWorks  is  its  provision  of  cross 
compilation  tools  as  well  as  a  runtime  layer  using  which 
our  ideas  concerning  suitable  NI  functionality  for  qual¬ 
ity  management  may  be  prototyped  rapidly. 

In  general,  the  criteria  for  choice  of  an  embedded 
kernel  can  be  delineated  very  simply  as  follows:  small 
footprint,  lightweight,  configurability,  support  for  de¬ 
vices  on  the  I/O  controller  card,  and  cross  platform 
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development  and  debugging  facilities.  When  choosing 
an  NI  for  this  project,  the  first  option  we  considered 
was  to  construct  a  custom  kernel  for  the  i960.  Our 
desire  to  rapidly  prototype  a  functional  system  led  to 
the  choice  of  a  commercial  kernel  for  the  i960  RD  card. 
Two  viable  offerings  from  Wind  River  Systems  are  the 
IxWorks  system  developed  as  part  of  the  120  indus¬ 
try  initiative  and  the  VxWorks  kernel.  The  ROM- 
resident  IxWorks  system  provides  a  flexible  environ¬ 
ment  for  120  device  driver  development,  for  monitor¬ 
ing  driver  performance,  and  it  offers  sophisticated  mes¬ 
sage  passing  and  event  queues  between  multiple  lOP 
boards.  However,  it  assumes  that  message  transport 
is  performed  using  the  120  standard.  We  decided  not 
to  pursue  this  approach  due  to  our  focus  on  high  per¬ 
formance  (including  low  latency)  communications.  To¬ 
ward  this  end,  we  wish  to  experiment  with  alternative 
120  -  host  communication  interfaces.  Such  experimen¬ 
tation  is  enabled  by  the  second  commercial  offering 
from  Wind  River  Systems  for  the  i960  RD  environ¬ 
ment:  the  VxWorks  development  environment.  The 
VxWorks  operating  system  kernel  is  based  on  the  same 
microkernel  as  the  IxWorks  120  system.  In  contrast  to 
IxWorks,  VxWorks  is  highly  configurable,  as  it  can  be 
scaled  from  a  small  footprint  ROMable  version  to  a 
larger  footprint,  full-featured  Operating  System.  Vx¬ 
Works  makes  no  specific  assumptions  about  the  120  - 
host  interface[20,  19]. 

QUIC  Communication  Paradigm  QUIC  utilizes 
a  general  model  of  processing  stream  data  closer  to  the 
network  interface  wherein  the  nature  of  the  process¬ 
ing  may  include  simple  data  computations  as  well  as 
scheduling  computations.  Towards  this  end,  QUIC  en¬ 
ables  the  dynamic  placement  of  computations  within 
the  NI.  A  request  for  the  establishment  of  a  connec¬ 
tion  will  identify  the  required  computations  and  specify 
desired  levels  of  service  quality  based  on  which  addi¬ 
tional  functionality  such  as  admission  control,  polic¬ 
ing  (for  network  bandwidth)  and  scheduling  may  be 
performed.  At  this  point,  we  simply  point  out  that 
the  establishment  of  a  connection  concerns  not  just 
the  allocation  of  communication  bandwidth  to  enable 
real-time  transmission,  but  also  the  identification  of 
the  computations  relevant  to  the  data  stream  and  the 
allocation  of  appropriate  computational  resources  so 
that  the  data  may  be  operated  on  in  real-time.  The 
two  classes  of  computations  considered  in  our  work  are 
(1)  computations  that  operate  on  the  actual  stream 
data  and  (2)  those  that  concern  the  runtime  control 
(ie.,  scheduling)  for  streams.  In  either  case,  concern¬ 
ing  communication  coprocessors,  this  implies  the  run¬ 
time  extension  of  these  coprocessors  with  functionality 


suited  for  specific  streams.  An  architecture  and  suit¬ 
able  interfaces  for  this  purpose  is  being  developed  and 
is  described  as  follows. 

Virtual  Communication  Machine  To  application 
programs,  the  NI  is  abstracted  as  virtual  communica¬ 
tion  machine  (VCM),  where  its  visible  interface  to  ap¬ 
plications  is  one  that  (1)  defines  the  computations  (in¬ 
structions)  it  is  able  to  perform  on  the  applications’  be¬ 
half  and  that  (2)  offers  functions  for  the  machine’s  run¬ 
time  extension  to  add  or  subtract  instructions  as  well 
as  reporting  necessary  elements  of  its  internal  state. 

Internally,  the  VCM’s  software  architecture  aims  to 
provide  adequate  support  for  using  the  NI  processor  to 
improve  the  performance  of  streams  used  by  network 
applications.  Towards  this  end,  the  VCM  provides  an 
efficient  environment  for  executing  application-specific 
computational  modules  that  can  benefit  from  running 
‘closer  to  the  network’.  We  use  the  term  application 
specific  extension  modules  (ASEM)  to  refer  to  such 
application-specific  code  that  is  ‘directly’  using  the  NI 
resources.  In  our  testbed,  the  VCM  will  execute  on  the 
i960-based  120  coprocessor  to  which  multiple  disks  and 
network  links  are  attached.  ASEMs  provide  a  common 
abstraction  and  can  be  dynamically  placed  in  the  NI 
to  process  streams,  perform  scheduling,  or  manage  NI 
resources  such  as  disks. 

Host  applications  communicate  with  the  VCM 
through  shared  (host  or  NI)  memory.  The  union  of 
all  these  memory  regions  is  called  the  VCM  address 
space.  Applications  issue  execution  requests  to  the  NI 
as  VCM  tasks.  Tasks  can  be  created  and  destroyed 
dynamically.  The  first  VCM  task  of  an  application  is 
created  automatically  when  the  application  connects 
to  the  NI.  The  most  common  example  of  a  task  is  one 
that  executes  a  VCM  program  which  in  turn  is  com¬ 
prised  of  core  and  ’added’  instructions. 

VCM  programs  are  built  as  sequences  of  core  and/or 
‘added’  VCM  instructions.  Core  instructions  imple¬ 
ment  VCM  functionality  always  resident  on  the  VCM, 
including  the  functionality  shared  by  all  extensions 
and  that  required  to  configure  the  VCM’s  operation. 
Added  instructions  implement  the  extensions  used  by 
certain  application  programs.  In  other  words,  added 
instructions  are  functions  specific  to  certain  applica¬ 
tions. 

The  VCM  instruction  dispatcher  is  implemented  as 
an  interpreter  running  on  the  NI  processor.  The  dis¬ 
patcher  reads  each  VCM  instruction,  checks  the  avail¬ 
ability  of  its  parameters  and  activates  the  appropriate 
code.  The  NI  processor  transfers  back  results  to  the 
application  by  writing  into  the  memory  regions  shared 
between  the  NI  and  the  application. 
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The  overhead  of  the  interaction  between  applica¬ 
tions  and  VCM  instructions  is  low  because  of  the 
shared-memory-based  implementation.  To  further 
lower  this  overhead,  access  patterns  to  the  shared  re¬ 
gions  are  used  to  determine  their  placement  in  the  host 
or  the  NI  memory.  Using  shared  memory  also  helps  in 
decoupling  the  executions  on  the  host  and  NL  Towards 
this  end,  applications  and  interface  layer  can  build  in 
the  VCM  address  space  arbitrary  long  chains  of  Hasks- 
to-be-executed’  and  of  ‘tasks-completed’  descriptors, 
respectively.  In  addition  to  the  shared-memory-based 
interaction,  the  interface  layer  can  also  signal  to  ap¬ 
plication  processes  using  signaling  primitives  provided 
in  the  host  operating  system.  We  note  that  certain 
elements  in  the  proposed  architecture  were  influenced 
by  our  experience  with  a  small  prototype  built  around 
an  OC-3  ATM  card  (FORE  SBA-200E).  This  commer¬ 
cial  NI  features  a  25  MHz  i960CA  processors  and  256K 
SRAM.  Some  details  of  our  design  appear  next. 

Extension  Functionality  and  Interface  We  now 
comment  on  the  ‘extension’  instructions  part  of  the 
core  instruction  set.  The  conceptual  basis  for  this  work 
are  (1)  our  previous  experiences  with  the  implementa¬ 
tion  of  a  VCM  for  FORE  ATM  interface  boards[4]  and 
(2)  event-based  mechanisms  developed  by  our  group 
for  uniprocessor  and  distributed  systems.  The  basic 
idea  of  these  mechanisms  is  to  permit  applications  to 
deflne  events  of  certain  types,  to  associate  (at  runtime) 
handlers  for  these  events,  and  to  create  event  channels 
to  which  event  producers  and  consumers  can  subscribe. 
In  our  system,  handlers  are  executed  anytime  an  event 
is  produced  or  consumed,  at  the  producing  and  at  the 
consuming  side  of  the  event  channel.  For  online  VCM 
extension,  then,  the  application  may  produce  an  ex¬ 
tension  event  and  provide  a  new  handler  for  this  event 
type.  The  handler  code  is  installed  at  runtime  on  the 
VCM,  resulting  in  the  creation  of  a  new  VCM  instruc¬ 
tion  ready  for  use  by  the  application  program.  Inter¬ 
actions  of  the  new  VCM  instruction  with  lower  level 
VCM  facilities  are  resolved  at  installation  time,  as  well. 

The  advantage  of  using  this  event-based  approach 
is  our  ability  to  have  any  number  of  VCM’s  listen  for 
extension  events  from  any  number  of  application  pro¬ 
grams,  thereby  offering  a  scalable  approach  to  system 
extension  even  for  large-scale  machines.  Some  imple¬ 
mentation  details  on  VCM  extension  follow. 

Each  extension  module  is  assigned  one  or  more  VCM 
instruction  opcodes  and  control  message  IDs.  An  ex¬ 
tension  module  is  a  collection  of  the  following  types  of 
handlers: 

♦  VCM  instruction  handlers  invoked  by  the  VCM 


instruction  dispatcher  upon  encountering  an  in¬ 
struction  assigned  to  the  module, 

•  control  message  handlers,  invoked  upon  the  deliv¬ 
ery  of  a  control  message  with  aii  ID  assigned  to 
the  module,  and 

•  time-out  handlers. 

These  handlers  share  the  state  of  the  extension  mod¬ 
ules,  which  is  stored  in  the  NI  memory.  We  imple¬ 
mented  five  extension  modules  in  the  ATM  prototype 
VCM.  They  all  share  the  structure  outlined  above  and 
they  consist  from  120  (the  simplest)  to  1170  (the  most 
complex)  lines  of  C  code. 

3.2  QUIC  Runtime  Environment 

The  QUIC  runtime  environment  (RTE)  is  implemented 
as  the  VCM’s  runtime  layer.  This  layer  has  two 
components,  the  first  of  which  provides  a  set  of 
technology-independent  low-level  communication  ab¬ 
stractions.  These  abstractions  make  writing  extension 
modules  easier,  as  the  application  programmer  does 
not  have  to  be  aware  of  the  hardware  details  of  a  par¬ 
ticular  NI  card.  In  addition,  these  abstractions  are 
independent  of  the  underlying  networking  technology 
(Ethernet,  ATM,  or  Myrinet),  thereby  making  the  ex¬ 
tension  modules  portable  at  the  source  code  level.  On 
top  of  this  layer,  the  second  component  of  the  RTE  is 
comprised  of  communication  abstractions  that  support 
a  collection  of  core  services  needed  by  most  extension 
modules.  By  providing  these  services  as  part  of  the 
RTE,  most  of  the  functionality  overlap  between  exten¬ 
sion  modules  is  eliminated.  The  implementation  of  this 
RTE  component  takes  advantage  of  all  the  hardware 
support  available  on  the  NI  to  provide  the  best  perfor¬ 
mance  and,  consequently,  is  highly  dependent  on  the 
NI  hardware. 

Principles  Guiding  RTE  Design  The  NI- 
dependent  implementation  of  the  first  RTE  component 
depends  on  the  speed  of  the  NI  processor,  the  capacity 
of  the  NI  memory,  and  the  overheads  of  driving  the 
network  interconnect  at  full  speed.  A  single-threaded 
library  implementation  of  the  RTE  is  the  right  choice 
for  NIs  with  limited  resources.  An  implementation 
based  on  a  kernel  for  an  embedded  operating  system 
should  only  be  used  for  NIs  with  sufficient  resources: 
fast  processors,  large  memories,  and  interconnect- 
specific  hardware  that  support  the  NI  processor  in 
driving  the  attached  interconnects.  Between  these 
two  alternatives  sits  the  RTE  implementation  as  a 
multithreaded 
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RTE  Components  The  RTE  is  comprised  of  several 
components.  The  key  issue  in  including  components 
in  the  RTE  is  that  they  are  used  by  several  diiferent 
extension  modules.  Our  current  design  includes  the 
following  components. 

Control  Messaging  System:  This  component  imple¬ 
ments  reliable  and  in-order  delivery  of  short  messages 
between  the  NI  processors.  These  short  messages  are 
called  control  messages  and  are  used  by  extension  mod¬ 
ules  on  different  NIs  to  exchange  information.  We 
found  both  reliability  and  in-order  delivery  very  con¬ 
venient  when  writing  extension  modules.  Reliability 
needs  to  be  implemented  in  the  NIs  as  most  SANs  do 
not  guarantee  100%  message  deliver^.  As  message  loss 
rates  and  latencies  are  relatively  low  on  SANs,  imple¬ 
menting  in-order  delivery  should  not  increase  the  over¬ 
head  of  the  NI  processor  considerably.  Ideally,  the  la¬ 
tency  of  control  messages  should  be  only  slightly  higher 
than  the  hardware-imposed  limit.  To  achieve  this  per¬ 
formance  goal,  buffers  for  control  messages  are  pre¬ 
allocated  in  the  NI  memory.  In  our  current  prototype, 
reliable  delivery  is  based  on  a  sliding  window  proto¬ 
col.  Block  acknowledgments  are  used  between  interface 
processors  to  acknowledge  the  receipt  of  multiple  mes¬ 
sages  although  upon  request,  certain  control  messages 
can  be  acknowledged  immediately.  This  latter  feature 
is  useful  in  building  fault-tolerant  applications.  For  in¬ 
stance,  we  implemented  this  feature  in  our  prototype 
and  used  it  in  an  remote  write  extension  module.  This 
extension  module  is  designed  to  improve  the  perfor¬ 
mance  of  applications  that  achieve  fault-tolerance  by 
maintaining  a  copy  of  their  state  in  the  memory  of  a 
remote  host. 

Time-out  Components:  A  second  RTE  component  im¬ 
plements  several  of  time-out  primitives,  with  different 
granularities  and  precision.  The  extended  set  of  time¬ 
out  primitives  is  necessary  because  we  expect  more  pre¬ 
cision  and/or  finer  granularity  to  imply  higher  NI  pro¬ 
cessor  overhead. 

Message  Management  Components:  Routines  for  ef¬ 
ficient  assembling/disassembling  of  large  messages 
from/into  arbitrary  collections  of  memory  segments, 
placed  either  in  the  host  or  NI  memory,  are  another 
RTE  service.  Our  ATM-based  prototype  includes  rou¬ 
tines  implementing  zero-copy  messaging:  outgoing  or 
incoming  data  is  moved  between  the  network  registers 
and  final  destination  without  any  intermediate  copies 
in  the  host  or  NI  memory.  Our  remote  memory  ac¬ 
cess  and  bulk  messaging  extension  modules  use  these 
routines. 

Memory  Management  Component:  The  RTE  includes 
a  dynamic  memory  management  system.  Our  proto¬ 
type  includes  a  heap  module  which  is  implemented  as 


buddy  system  to  achieve  better  predictability. 

4  QUIC  Quality  of  Service 
Management 

The  QUIC  scheduler  represents  an  approach  that  em¬ 
phasizes  dynamic  adaptation  of  scheduler  parameters. 
We  are  pursuing  an  approach  wherein  existing  schedul¬ 
ing  algorithms  can  be  modeled  while  permitting  the 
algorithms  to  be  adapted  over  time  in  an  application- 
specific  manner  to  respond  to  varying  QoS  needs. 

4.1  QoS  Management  Paradigm 

The  QUIC  project  is  exploring  a  flexible  scheduling 
paradigm  to  concurrently  satisfy  diverse  QoS  require¬ 
ments  across  multiple  data  streams.  Packet  priorities 
are  dynamically  updated  at  run-time  to  enable  QoS  re¬ 
quirements  to  be  met.  Essentially  the  packet  priority 
is  scaled  as  a  function  of  the  QoS  that  a  packet  has  ex¬ 
perienced  up  to  that  point  in  time  relative  to  the  QoS 
requested  by  the  packet.  Such  an  update  operation 
has  been  referred  to  as  priority  biasing  [7,  9,  8]  since 
the  priority  value  is  biased  by  the  relative  degrada¬ 
tion  of  its  service.  The  biasing  operation  couples  the 
effect  of  the  scheduler  (e.g.,  queuing  delay)  with  the 
QoS  demand  (e.g.,  jitter  bound).  This  distinguishes 
this  approach  from  priority  update  mechanisms  such 
as  age  counters  that  do  not  distinguish  between  QoS 
requested  by  distinct  connections. 

For  example,  consider  only  constant  bit  rate  (CBR) 
connections  where  QoS  is  measured  by  the  bandwidth 
allocated  to  a  connection.  The  priority  of  a  packet 
can  be  computed  as  the  ratio  of  the  queuing  delay  to 
the  connection’s  inter-arrival  time.  Increasing  queuing 
delay  increases  its  priority  in  successive  scheduling  cy¬ 
cles.  However,  the  rate  at  which  the  priority  increases 
depends  on  the  bandwidth  of  the  connection.  Such  a 
priority  biasing  mechanism  couples  the  ongoing  effect 
of  the  switch  scheduler  (queuing  delay)  with  a  mea¬ 
sure  of  the  demands  made  by  the  application  (connec¬ 
tion  bandwidth).  As  the  negative  impact  of  the  switch 
scheduler  grows  so  does  the  priority,  effectively  ’’bias¬ 
ing  it”  with  time.  Thus  different  connections  are  bi¬ 
ased  at  different  rates,  i.e.,  higher  speed  connections 
are  biased  faster. 

QUIC  explores  a  hardware/software  implementa¬ 
tion  of  a  generalized  priority  biasing  framework.  Our 
hypothesis  is  that  by  customizing  biasing  calculations 
by  stream  and  data  type  and  providing  the  ability  to 
dynamically  control  biasing  calculations  we  can  achieve 
more  effective  utilization  of  network  resources  and 
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thereby  satisfy  the  QoS  requirements  of  a  larger  num¬ 
ber  of  communication  requests.  Two  major  issues  the 
framework  must  address  are:  when  is  the  priority  of  a 
scheme  biased  and  how  is  it  biased.  QUIC  currently 
implements  a  dynamic  window  constrained  scheduling 
algorithm  (DWCS)  that  provides  two  parameters  for 
controlling  the  ‘‘when”  and  “how”  components  of  gen¬ 
eralized  priority  biasing.  This  paradigm  is  quite  pow¬ 
erful  in  that  it  has  been  shown  to  be  able  to  model 
a  range  of  existing  scheduling  algorithms  [17]  while  it 
also  provides  for  dynamic  control  of  scheduling  param¬ 
eters  thus  providing  new  avenues  for  optimizing  the 
performance  of  heterogeneous  communication  streams. 
The  remainder  of  this  section  describes  the  current  in¬ 
stantiation  of  DWCS  hosted  within  the  network  inter¬ 
faces, 

4.2  QUIC  Quality  Management  -  The 
DWCS  Approach 

The  first  parameter  utilized  by  DWCS  controls  the  in¬ 
terval  of  time  between  priority  adjustments  of  its  sec¬ 
ond  parameter.  The  second  parameter  is  simply  the 
biasing  value^  used  to  decide  which  stream  has  the 
highest  priority  and,  hence,  which  stream  should  be 
scheduled  for  transmission.  Simply,  the  biasing  value 
is  dynamically  adjusted  at  specific  intervals  of  time. 
Furthermore,  DWCS  is  flexible,  that  the  biasing  value 
could  be  adjusted  based  on  the  needs  of  individual 
streams.  Our  current  implementation  utilizes  packet 
deadlines  and  loss-tolerance  as  the  two  scheduling  pa¬ 
rameters.  The  motivation  for  this  choice  and  detailed 
description  are  provided  in  the  following. 

Applications,  such  as  real-time  media  servers  need 
to  service  hundreds  and,  possibly,  thousands  of  clients, 
each  with  their  own  quality  of  service  (QoS)  require¬ 
ments.  Many  such  clients  can  tolerate  the  loss  of 
a  certain  fraction  of  the  information  requested  from 
the  server,  resulting  in  little  or  no  noticeable  degrada¬ 
tion  in  the  client’s  perceived  quality  of  service  when 
the  information  is  received  and  processed.  Conse¬ 
quently,  loss-rate  is  an  important  performance  mea¬ 
sure  for  the  service  quality  to  many  clients  of  real-time 
media  servers.  We  define  the  term  loss-rate[l6^  12]  as 
the  fraction  of  packets  in  a  stream  either  discarded  or 
serviced  later  than  their  delay  constraints  allow.  How¬ 
ever,  from  a  client’s  point  of  view,  loss-rate  could  be 
the  fraction  of  packets  either  received  late  or  not  re¬ 
ceived  at  all. 

One  of  the  problems  with  using  loss-rate  as  a  perfor¬ 
mance  metric  is  that  it  does  not  describe  when  losses 
are  allowed  to  occur.  For  most  loss- tolerant  applica¬ 
tions,  there  is  usually  a  restriction  on  the  number  of 


consecutive  packet  losses  that  are  acceptable.  For  ex¬ 
ample,  losing  a  series  of  consecutive  packets  from  an 
audio  stream  might  result  in  the  loss  of  a  complete 
section  of  audio,  rather  than  merely  a  reduction  in  the 
signal- to-noise  ratio.  A  suitable  performance  measure 
in  this  case  is  a  windowed  loss-rate,  i.e.  loss-rate  con¬ 
strained  over  a  finite  range,  or  window,  of  consecutive 
packets.  More  precisely,  an  application  might  tolerate 
X  packet  losses  for  every  y  arrivals  at  the  various  ser¬ 
vice  points  across  a  network.  Any  service  discipline  at¬ 
tempting  to  meet  these  requirements  must  ensure  that 
the  number  of  violations  to  the  loss-tolerance  specifica¬ 
tion  is  minimized  (if  not  zero)  across  the  whole  stream. 

Some  clients  cannot  tolerate  any  loss  of  information 
received  from  a  server,  but  such  clients  often  require 
delay  bounds  on  the  information.  Consequently,  these 
type  of  clients  require  deadlines  which  specify  the  max¬ 
imum  amount  of  time  packets  of  information  from  the 
server  can  be  delayed  until  they  become  invalid.  Fur¬ 
thermore,  some  multimedia  applications  often  require 
jitter,  or  delay  variation,  to  be  minimized.  Such  a  re¬ 
quirement  can  be  satisfied  by  restricting  the  service  for 
an  application  to  commence  no  earlier  than  a  specified 
earliest  time  and  no  later  than  the  deadline  time. 

To  guarantee  such  diverse  QoS  requires  fast  and  ef¬ 
ficient  scheduling  support  at  the  server.  This  section 
describes  the  features  specific  to  a  real-time  packet 
scheduler  resident  on  a  server  (specifically  designed 
to  run  on  either  the  host  processor  or  the  network 
interface  card),  designed  to  meet  service  constraints 
on  information  transferred  across  a  network  to  many 
clients.  Specifically,  we  describe  Dynamic  Window- 
Constrained  Scheduling  (DWCS),  which  is  designed  to 
meet  the  delay  and  loss  constraints  on  packets  from 
multiple  streams  with  different  performance  objectives. 
In  fact,  DWCS  is  designed  to  limit  the  number  of  late 
packets  over  finite  numbers  of  consecutive  packets  in 
loss- tolerant  and/or  delay-constrained,  heterogeneous 
traffic  streams. 

4.3  The  DWCS  Scheduler 

DWCS  is  designed  to  maximize  network  bandwidth  us¬ 
age  in  the  presence  of  multiple  packets  each  with  their 
own  service  constraints.  The  algorithm  requires  two 
attributes  per  packet  stream,  as  follows: 

•  Deadline  -  this  is  the  latest  time  a  packet  can  com¬ 
mence  service.  The  deadline  is  determined  from  a 
specification  of  the  maximum  allowable  time  be¬ 
tween  servicing  consecutive  packets  in  the  same 
stream  (ie.,  the  maximum  inter-packet  gap), 

•  Loss-tolerance  -  this  is  specified  as  a  value  xi/yi, 
where  x,-  is  the  number  of  packets  that  can  be  lost 
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or  transmitted  late  for  every  window,  yi,  of  con¬ 
secutive  packet  arrivals  in  the  same  stream,  i.  For 
every  yi  packet  arrivals  in  stream  i,  a  minimum  of 
yi  —  packets  must  be  scheduled  on  time,  while 
at  most  Xi  packets  can  miss  their  deadlines  and 
be  either  dropped  or  transmitted  late,  depending 
on  whether  or  not  the  attribute-based  QoS  for  the 
stream  allows  some  packets  to  be  lost. 

At  any  time,  all  packets  in  the  same  stream  have 
the  same  loss- tolerance,  while  each  successive  packet 
in  a  stream  has  a  deadline  that  is  offset  by  a  fixed 
amount  from  its  predecessor.  Using  these  attributes, 
DWCS:  (1)  can  limit  the  number  of  late  packets  over 
finite  numbers  of  consecutive  packets  in  loss-tolerant 
or  delay-constrained,  heterogeneous  traffic  streams,  (2) 
does  not  require  a-priori  knowledge  of  the  worst-case 
loading  from  multiple  streams  to  establish  the  neces¬ 
sary  bandwidth  allocations  to  meet  per-stream  delay 
and  loss-constraints,  (3)  can  safely  drop  late  packets 
in  lossy  streams  without  unnecessarily  transmitting 
them,  thereby  avoiding  unnecessary  bandwidth  con¬ 
sumption,  and  (4)  can  exhibit  both  fairness  and  un¬ 
fairness  properties  when  necessary.  In  fact,  DWCS  can 
perform  fair-bandwidth  allocation,  static  priority  (SP) 
and  earliest-deadline  first  (EDF)  scheduling. 

4.4  DWCS  Algorithm 

Dynamic  Window-Constrained  Scheduling  (DWCS) 
orders  packets  for  transmission  based  on  the  current 
values  of  their  loss-tolerances  and  deadlines.  Prece¬ 
dence  is  given  to  the  packet  at  the  head  of  the  stream 
with  the  lowest  loss- tolerance.  Packets  in  the  same 
stream  all  have  the  same  original  and  current  loss- 
tolerances,  and  are  scheduled  in  their  order  of  ar¬ 
rival.  Whenever  a  packet  misses  its  deadline,  the  loss- 
tolerance  for  all  packets  in  the  same  stream,  s,  is  ad¬ 
justed  to  reflect  the  increased  importance  of  transmit¬ 
ting  a  packet  from  s.  This  approach  avoids  starv¬ 
ing  the  service  granted  to  a  given  packet  stream,  and 
attempts  to  increase  the  importance  of  servicing  any 
packet  in  a  stream  likely  to  violate  its  original  loss 
constraints.  Conversely,  any  packet  serviced  before  its 
deadline  causes  the  loss- tolerance  of  other  packets  (yet 
to  be  serviced)  in  the  same  stream  to  be  increased, 
thereby  reducing  their  priority. 

The  loss-tolerance  of  a  packet  (and,  hence,  the  cor¬ 
responding  stream)  changes  over  time,  depending  on 
whether  or  not  another  (earlier)  packet  from  the  same 
stream  has  been  scheduled  for  transmission  by  its  dead¬ 
line.  If  a  packet  cannot  be  scheduled  by  its  deadline,  it 
is  either  transmitted  late  (with  adjusted  loss- tolerance) 
or  it  is  dropped  and  the  deadline  of  the  next  packet  in 


the  stream  is  adjusted  to  compensate  for  the  latest  time 
it  could  be  transmitted,  assuming  the  dropped  packet 
was  transmitted  as  late  as  possible. 

Pairwise  Packet  Ordering 

_ Lowest  loss-tolerance  first _ 

Same  non-zero  loss-tolerance,  order  EDF 
Same  non-zero  loss-tolerance  &  deadlines, 

order  lowest  loss-numerator  first _ 

Zero  loss- tolerance  &  denominators, 

_ order  EDF _ _ 

Zero  loss- tolerance,  order 

highest  loss-denominator  first _ 

All  other  ca^es:  first-come-first-serve 

Table  1:  Precedence  amongst  pairs  of  packets 

Table  1  shows  the  rules  for  ordering  pairs  of  pack¬ 
ets  in  different  streams.  Recall  that  all  packets  in  the 
same  stream  are  queued  in  their  order  of  arrival.  If  two 
packets  have  the  same  non-zero  loss- tolerance,  they 
are  ordered  earliest-deadline  first  (EDF)  in  the  same 
queue.  If  two  packets  have  the  same  non-zero  loss- 
tolerance  and  deadline  they  are  ordered  lowest  loss- 
numerator  Xi  first,  where  Xi/yi  is  the  current  loss- 
tolerance  for  all  packets  in  stream  i.  By  ordering  on 
the  lowest  loss-numerator,  precedence  is  given  to  the 
packet  in  the  stream  with  tighter  loss  constraints,  since 
fewer  consecutive  packet  losses  can  be  tolerated.  If 
two  packets  have  zero  loss-tolerance  and  their  loss- 
denominators  are  both  zero,  they  are  ordered  EDF , 
otherwise  they  are  ordered  highest  loss-denominator 
first.  If  it  is  paramount  that  a  stream  never  loses  more 
packets  than  its  loss-tolerance  permits,  then  admission 
control  must  be  used,  to  avoid  accepting  connections 
whose  QoS  constraints  cannot  be  met  due  to  existing 
connections’  service  constraints. 

Every  tiiiie  a  packet  in  stream  i  is  transmitted,  the 
loss- tolerance  of  i  is  adjusted.  Likewise,  other  streams’ 
loss- tolerances  are  adjusted  only  if  any  of  the  pack¬ 
ets  in  those  streams  miss  their  deadlines  as  a  result  of 
queueing  delay.  Consequently,  DWCS  requires  worst- 
case  0(n)  time  to  select  the  next  packet  for  service 
from  those  packets  at  the  head  of  n  distinct  streams. 
However,  the  average  case  performance  can  be  far  bet¬ 
ter,  because  not  all  streams  always  need  to  have  their 
loss-tolerances  adjusted  after  a  packet  transmission. 
Loss- Tolerance  Adjustment.  Loss-tolerances  are 
adjusted  by  considering  xi/yi,  which  is  the  original 
loss-tolerance  for  all  packets  in  stream  i,  and  x •/?/•, 
which  is  the  current  loss-tolerance  for  all  queued  pack¬ 
ets  in  stream  i.  The  basic  idea  of  these  adjustments 
is  to  adjust  loss  numerators  and  denominators  for  all 
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buffered  packets  in  the  same  stream  i  as  the  packet 
most  recently  transmitted  before  its  deadline.  The  de¬ 
tails  of  these  adjustments  appear  in  [17,  18].  Here,  it 
suffices  to  say  that  DWCS  has  the  ability  to  imple¬ 
ment  a  number  of  real-time  and  non-real-time  policies. 
Moreover,  DWCS  can  act  as  a  fair  queueing,  static  pri¬ 
ority  and  earliest-deadline  first  scheduling  algorithm, 
as  well  as  provide  service  for  a  mix  of  static  and  dy¬ 
namic  priority  traffic  streams. 

4.5  Programmable  Hardware  Support 

The  explosive  growth  in  the  functionality  of  config¬ 
urable  or  programmable  hardware  in  the  form  of  field 
programmable  gate  arrays  (FPGAs)  is  changing  the  ar¬ 
chitecture  of  information  systems  that  deal  with  data 
intensive  computations  [10,  11].  For  example,  we  have 
seen  the  advent  of  configurable  computing  systems 
where  programmable  hardware  is  coupled  with  pro¬ 
grammable  processors.  Hardware/software  co-design 
has  emerged  as  an  associated  design  paradigm  where 
the  programmable  hardware  components  (in  the  form 
of  FPGAs)  and  software  components  (executed  on 
processors)  are  designed  concurrently  with  efficient 
trade-offs  across  the  HW/SW  boundary.  While  this 
paradigm  is  largely  targeted  towards  embedded  com¬ 
puter  systems,  we  can  apply  relevant  concepts  to  the 
design  of  intelligent  network  interfaces. 

FPGA  devices  effectively  represent  hardware  pro¬ 
grammable  alternatives  to  system  level  application  spe¬ 
cific  integrated  circuit  (ASIC)  designs  and  at  a  dras¬ 
tically  reduced  cost.  Modern  devices  can  be  dynam¬ 
ically  and  incrementally  re-programmed  in  microsec¬ 
onds,  have  increased  memory  on  chip,  and  operate  two 
to  four  times  faster  than  current  chips.  FPGA  devices 
perform  particularly  well  on  regular  computations  over 
large  data  sets.  This  technology  naturally  fits  in  archi¬ 
tectures  that  stream  and  operate  on  large  amounts  of 
data  as  in  the  bulk  of  emerging  network-based  multi- 
media  applications.  The  hardware  functionality  in  the 
interfaces  can  be  re-programmed  ”on  the  fly”  almost 
as  if  we  were  swapping  out  custom  devices. 

We  are  motivated  to  include  FPGA  devices  in  the 
NIs  for  two  specific  reasons.  The  first  is  to  host  priority 
biasing  calculations.  Such  calculations  are  inherently 
parallel  with  priorities  being  computed  across  many 
connections.  A  large  number  of  relatively  simple  in¬ 
dependent  computations  can  be  effectively  supported 
within  FPGAs.  However,  the  overhead  of  such  com¬ 
putations  can  render  software  NI  implementations  of 
certain  biasing  calculations  either  infeasible  or  greatly 
reduce  the  link  utilizations  that  can  be  achieved.  The 
second  reason  is  the  ability  to  host  certain  classes  of 


data  stream  computations  in  the  network  interface. 
For  example,  data  filtering,  encryption,  and  compres¬ 
sion  are  candidates  for  implementation  with  FPGAs 
available  in  the  NI.  Such  computations  can  be  nat¬ 
urally  performed  on  data  streams  during  transmis¬ 
sion  rather  than  via  (relatively)  expensive  traversals 
through  the  memory  hierarchy  to  the  CPU.  The  VCM 
environment  can  provide  access  to  these  hardware  de¬ 
vices  via  extension  modules  that  can  be  used  to  load 
configuration  data  into  the  FPGAs.  Thus,  the  abstrac¬ 
tions  used  to  extend  the  NI  functionality  dynamically 
are  the  same  for  functions  implemented  in  software  or 
programmable  hardware.  Our  goal  is  to  leverage  this 
FPGA  technology  to  enable  more  powerful  yet  (rel¬ 
atively)  inexpensive  network  interfaces  that  can  sub¬ 
stantially  enhance  the  performance  of  network  appli¬ 
cations. 

5  Concluding  Remzirks 

The  goal  of  the  QUIC  project  is  the  development  of  an 
effective  quality  management  infrastructure  that  can 
service  a  large  number  of  connections,  each  with  dis¬ 
tinct  service  requirements.  Towards  this  end  we  are 
constructing  an  experimental  infrastructure  using  a 
cluster  of  PCs  interconnected  by  fast  ethernet  using 
i960  based  interface  cards.  At  the  center  of  our  efforts 
is  an  extensible  software  and  quality  management  in¬ 
frastructure.  By  placing  quality  management  function¬ 
ality  “close”  to  the  wire  and  I/O  components  attached 
to  the  120  boards,  we  expect  to  enable  QoS  guarantees 
at  higher  levels  of  resource  utilization  than  commod¬ 
ity  clusters  will  otherwise  permit.  Our  current  efforts 
are  geared  towards  creating  a  rapid  prototyping  en¬ 
vironment  to  provide  a  basis  for  experimentation  and 
investigation. 

References 

[1]  K.  Dienfendorff  and  P.  Dubey.  How  multime¬ 
dia  workloads  will  change  processor  design.  IEEE 
Computer^  vol.  30,  no.  9,  pp.  43-45,  September 
1997. 

[2]  Richard  P.  Martin  and  Amin  M.  Vahdat  and  David 
E.  Culler  and  Thomas  E.  Anderson.  Effects  of  Com¬ 
munication  Latency,  Overhead  and  Bandwidth  in  a 
Cluster  Architecture.  Proceedings  of  the  24th  An¬ 
nual  International  Symposium  on  Computer  Archi- 
tecture^  June  1997. 

[3]  Wilson  C.  Hsieh,  Kirk  L.  Johnson,  M.  Frans 
Kaashoek,  Deborah  A.  Wallach,  and  William  E. 


207 


Weihl.  Efficient  implementation  of  high-level  lan¬ 
guages  on  user-level  communication  architectures. 
Proceedings  of  the  5th  ACM  SIGPLAN  Symposium 
on  Principles  and  Practice  of  Parallel  Programming 
(PPoPP  ^95),  1995. 

[4]  Marcel-Catalin  Rosu  and  Karsten  Schwan  and 
Richard  Fujimoto,  Supporting  Parallel  Applica¬ 
tions  on  Clusters  of  Workstations:  The  Intelligent 
Network  Interface  Approach.  Proceedings  of  the 
Sixth  IEEE  International  Symposium  on  High  Per¬ 
formance  Distributed  Computing  (HPDC-6)^  Au¬ 
gust  1997. 

[5]  M.  Rosu,  K.  Schwan,  and  R.  Fujimoto.  Support¬ 
ing  parallel  applications  on  clusters  of  workstations: 
the  virtual  communication  machine-based  architec¬ 
ture.  Cluster  Computing,  pp.  1029,  November  1997. 

[6]  C.  E.  kozyrakis  and  D.  Patterson.  A  new  direction 
for  computer  architecture  research.  IEEE  Com¬ 
puter,  vol.  31,  no.  11,  pp.  24-32,  November  1998. 

[7]  A.  A.  Chien  and  J.  H.  Kim.  Approaches  to  quality 
of  service  in  high  performance  networks.  Proceed¬ 
ings  of  the  Workshop  on  Parallel  Computer  Rout¬ 
ing  and  Communciation,  pp.  1-20,  June  1997. 

[8]  J.  H.  Kim.  Bandwidth  and  latency  guarantees  in 
low-cost  high  performance  networks.  Ph.D.  Thesis, 
Department  of  Computer  Sciences,  University  of 
Illinois,  Urbana-Champaign,  January  1997. 

[9]  D.  Garcia  and  D.  Watson.  ServerNet  II.  Proceed¬ 
ings  of  the  Workshop  on  Parallel  Computer  Rout¬ 
ing  and  Communication,  pp.  119-136,  June  1997. 

[10]  J.  Villasenor  and  W.  H.  Mangione-Smith.  Config¬ 
urable  computing  Scientific  American,  pp.  66-71, 
June  1997. 

[11]  W.  H.  Mangione-Smith,  B.  Hutchings,  D.  An¬ 
drews,  A.  DeHon,  C.  Ebeling,  R.  Hartenstein,  0. 
Mencer,  J.  Morris,  K.  Palem,  V.  K.  Prasanna, 
H.  Spaanenburg.  Seeking  solutions  in  configurable 
computing.  IEEE  Computer,  vol,  30,  no.  12,  pp. 
38-43,  December  1997. 

[12]  Domenico  Ferrari.  Client  requirements  for  real¬ 
time  communication  services.  IEEE  Communica¬ 
tions  Magazine,  28(ll):76-90,  November  1990. 

[13]  hO  Special  Interest  Group. 
www.i2osig.org/ architecture/techback98.html. 

[14]  Intel,  i960  Rx  I/O  Microprocessor  Developer's 
Manual,  April  1997. 


[15]  Intel.  IQ80960RX  Evaluation  Platform  Board 
Manual,  March  1997. 

[16]  Jon  M.  Peha  and  Fouad  A.  Tobagi.  A  cost- 
based  scheduling  algorithm  to  support  integrated 
services.  In  IEEE  INF0C0MM'91,  pages  741-753. 
IEEE,  1991. 

[17]  Richard  West  and  Karsten  Schwan.  Dynamic 
window-constrained  scheduling  for  multimedia  ap¬ 
plications.  Technical  Report  GIT-CC-98-18,  Geor¬ 
gia  Institute  of  Technology,  1998.  To  appear  in  the 
6th  International  Conference  on  Multimedia  Com¬ 
puting  and  Systems,  ICMCS’99,  Florence,  Italy. 

[18]  Richard  West,  Karsten  Schwan,  and  Christian 
Poellabauer.  Scalable  scheduling  support  for  loss 
and  delay  constrained  media  streams.  Technical 
Report  GIT-CC-98-29,  Georgia  Institute  of  Tech¬ 
nology,  1998. 

[19]  WindRiver  Systems.  VxWorks  Reference  Manual, 
1  edition,  February  1997. 

[20]  WindRiver  Systems.  Writing  hO  Device  Drivers 
in  IxWorks,  1  edition,  September  1997. 


208 


Adaptive  Distributed  Applications  on  Heterogeneous  Networks 

Thomas  Gross^’^,  Peter  Steenkiste^  and  Jaspal  Subhlok^ 

*  School  of  Computer  Science  ^Departement  Informatik  ^  Department  of  Computer  Science 
Carnegie  Mellon  University  ETH  Zurich  University  of  Houston 

Pittsburgh,  PA  15213  CH  8092  Zurich  Houston,  TX  77204 


Abstract 

Distributed  applications  execute  in  environments  that  can 
include  different  network  architectures  as  well  as  a  range 
of  compute  platforms.  Furthermore,  these  resources  are 
shared  by  many  users.  Therefore  these  applications  receive 
varying  levels  of  service  from  the  environment.  Since  the 
availability  of  resources  in  a  networked  environment  often 
determines  overall  application  performance,  adaptivity  is 
necessary  for  efficient  execution  and  predictable  response 
time.  However,  heterogeneous  systems  pose  many  chal¬ 
lenges  for  adaptive  applications.  We  discuss  the  range  of 
situations  that  can  benefit  from  adaptivity  in  the  context 
of  a  set  of  system  and  environment  parameters.  Adaptive 
applications  require  information  about  the  status  of  the  ex¬ 
ecution  environment  and  heterogeneous  environments  call 
fora  portable  system  to  provide  such  information.  We  dis¬ 
cuss  Remos  (Resource  Monitoring  System),  a  system  that 
allows  applications  to  collect  information  about  network 
and  host  conditions  across  different  network  architectures. 
Finally,  we  report  our  experience  and  performance  results 
from  a  set  of  adaptive  versions  o/ Airshed  pollution  model¬ 
ing  application  executing  on  a  networking  testbed. 

1  Introduction 

Many  distributed  applications  have  critical  response  time  re¬ 
quirements.  The  timeliness  of  a  response  however  depends 
on  the  availability  of  resources:  network  bandwidth  to  trans¬ 
fer  information  and  processor  cycles  to  perform  computa¬ 
tions.  In  heterogeneous  environments,  applications  seldom 
have  exclusive  access  to  resources.  Instead,  network  links 
and  processors  are  shared  by  many  applications  and  users. 

The  performance  of  a  fast  processor  or  network  link  can 
deteriorate  to  that  of  a  slow  one  with  additional  computation 
load,  but  if  the  application  can  move  to  another  system,  then 
the  user  may  not  experience  a  slowdown.  When  running  a 
distributed  simulation,  the  impact  of  link  congestion  can  be 
avoided  by  migrating  to  a  different  part  of  the  network,  A 
data  warehouse  may  appear  to  stop  operating  when  addi¬ 


tional  users  start  expensive  queries,  but  if  the  data  is  repli¬ 
cated  on  another  server,  the  application  may  switch  to  this 
server  and  thereby  preserve  the  perception  of  a  timely  re¬ 
sponse.  The  transfer  of  a  movie  is  subject  to  many  dropped 
frames  if  there  is  network  congestion.  However,  a  smart 
filter  may  be  able  to  remove  non-essential  frames  from  the 
movie  and  maintain  audio  and  video  synchronization  by 
reducing  bandwidth  requirement. 

All  of  these  examples  of  adaptivity  have  been  explored 
in  various  systems.  In  this  paper  we  attempt  to  present  a 
structure  to  these  approaches  that  allows  us  to  unify  the  de¬ 
velopment  of  interfaces  between  applications  and  environ¬ 
ments,  Since  heterogeneous  environments  provide  many 
challenges  to  application  developers,  it  is  important  that 
the  interface  that  provides  network  measurements  is  sim¬ 
ple  and  portable.  We  believe  that  a  uniform  framework  for 
developing  adaptive  applications  and  resource  monitoring 
systems  that  work  across  different  network  architectures  are 
the  essential  ingredients  for  speeding  up  the  development 
of  adaptive  applications. 

The  remainder  of  this  paper  is  organized  as  follows. 
We  first  describe  the  “space”  of  adaptation  options  that  are 
available,  using  an  example  scientific  simulation  to  illus¬ 
trate  the  choices.  We  then  give  an  overview  of  Remos  sys¬ 
tem  for  collecting  and  reporting  network  status,  and  present 
performance  results  for  an  adaptive  environmental  model¬ 
ing  application.  We  conclude  with  a  discussion  of  related 
work. 


2  Adaptivity  of  applications 

Adaptivity  allows  applications  to  run  efficiently  and  pre¬ 
dictably  under  a  broader  range  of  conditions.  Support  for 
adaptation  may  also  allow  applications  to  use  less  expen¬ 
sive  service  classes,  e.g.  best  effort  instead  of  guaranteed 
service.  Some  of  the  functionality  (and  complexity)  asso¬ 
ciated  with  adaptation  can  be  embedded  in  middleware,  but 
we  first  have  to  understand  the  dimensions  of  adaptation 
before  we  can  develop  general  purpose  libraries  or  middle- 
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ware  layers  to  support  adaptivity. 

Applications  can  adapt  along  a  number  of  “dimensions”. 
In  this  paper  we  focus  on  the  choice  of  resources  (space  di¬ 
mension),  the  time  of  adaptation,  and  the  interface  between 
the  application  and  the  runtime  system  (or  operating  sys¬ 
tem,  i.e.,  the  system  that  is  responsible  for  management  of 
resources).  In  each  case  we  first  sketch  the  full  spectrum 
of  options  available  to  applications  in  general,  and  we  then 
focus  on  the  options  that  are  of  most  interest  to  distributed 
scientific  simulations.  We  use  Airshed  environment  mod¬ 
eling  application  described  in  Section  4  to  illustrate  the 
performance  tradeoffs  associated  with  adaptation. 

2.1  Resource  classes 

Applications  have  access  to  a  wide  range  of  resources,  and 
they  often  have  a  choice  about  how  many  and  which  re¬ 
sources  they  can  use.  An  application  can  be  adaptive  with 
regard  to  the  number  of  processors  or  nodes  that  it  can  use 
or  its  adaptivity  may  be  restricted  to  the  space  of  the  net¬ 
work  environment,  i.e.,  the  number  of  nodes  is  fixed  but  the 
identity  of  the  nodes  is  determined  dynamically. 

Network  resources  are  another  candidate  for  adaptivity. 
Network  bandwidth  can  sometimes  be  traded  off  with  other 
parameters  such  as  the  fidelity  of  the  data  or  the  quality  of 
the  objects  that  are  transferred.  For  example,  by  chang¬ 
ing  the  size  or  the  frame  rate  of  a  movie,  an  application 
can  increase  or  decrease  its  bandwidth  demands.  Alter¬ 
nately,  applications  can  make  tradeoffs  between  different 
types  of  resources,  e.g.,  compression  can  be  used  to  re¬ 
duce  the  bandwidth  requirements,  but  then  CPU  cycles  are 
required  to  compress  and  decompress  the  data. 

Network  resources  are  often  not  directly  accessible  to  an 
application  but  their  use  is  determined  by  the  kind  of  ser¬ 
vice  that  the  application  requests.  Recently,  the  networking 
community  has  been  working  on  developing  integrated  ser¬ 
vices  networks  that  can  offer  a  range  of  services  [4].  The 
service  class  dimension  reflects  the  fact  that  an  application 
can  pick  a  service  class  that  best  matches  its  needs.  This 
decision  may  (should)  be  based  on  dynamic  conditions. 
E.g.,  when  setting  up  a  video  conference  over  a  network 
that  supports  differentiated  service,  the  user  or  the  applica¬ 
tion  would  like  to  pick  the  lowest  service  class  (best  effort 
service)  that  can  provide  sufficient  bandwidth,  A  higher 
service  class  (e.g.,  expedited  service)  will  be  selected  only 
if  it  can  deliver  the  bandwidth  that  a  lower  service  class  is 
unable  to  do. 

Scientific  simulations  can  potentially  use  any  of  the 
above  methods.  The  most  common  form  of  adaptation 
along  the  space  dimension  is  likely  to  be  the  addition  or 
deletion  of  execution  nodes,  as  well  as  migration  to  a  differ¬ 
ent  subnet  for  execution.  Rebalancing  the  load  on  different 
nodes  and  links  can  be  used  as  a  mechanism  to  adjust  to 


the  changing  network  status.  Another  option  for  adapting 
is  modifying  the  mapping  style  of  the  computation  onto  the 
nodes,  e.g.,  replication  of  data  and  computation  to  elim¬ 
inate  communication.  In  some  cases  application  compo¬ 
nents  can  choose  between  multiple  algorithms  with  differ¬ 
ent  computation  and  communication  requirements,  and  they 
can  switch  from  one  to  the  other  when  network  conditions 
change.  Finally,  scientific  simulations  can  also  adapt  in 
the  service  class  dimension  in  a  variety  of  ways,  although 
relatively  few  networks  today  offer  more  than  one  service 
class. 

2.2  Time  of  adaptation 

Along  the  time  dimension,  at  one  extreme,  applications 
adapt  only  at  compile  time.  E.g.,  the  user  may  hardwire  the 
number  of  processors  (nodes)  into  an  application  by  spec¬ 
ifying  this  number  at  the  time  the  application  is  compiled. 
However,  this  scenario  hardly  qualifies  as  “adaptivity”,  so 
we  will  not  discuss  it  further. 

A  more  flexible  option  is  that  the  program  be  compiled 
for  a  variable  number  of  nodes  and  the  actual  number  of 
nodes  for  execution  is  determined  at  the  time  the  appli¬ 
cation  is  executed.  Adaptation  is  in  general  based  on  the 
assumption  that  recent  past  conditions  are  a  good  predic¬ 
tor  of  near-term  future  conditions,  an  assumption  that  often 
holds.  Dynamic  adaptivity  provides  the  most  flexibility 
but  also  poses  the  biggest  challenges  to  the  application  de¬ 
signer.  The  designer  has  two  options  with  different  bene¬ 
fit/complexity  tradeoffs.  One  option  is  to  limit  adaptation 
to  load  or  start-up  time.  This  option  is  the  easiest  one  since 
the  applications  has  not  set  up  any  state  yet.  It  has  the  ob¬ 
vious  drawback  that  if  conditions  change  during  execution, 
the  application  will  be  unable  to  adapt  to  those  changes. 
An  example  is  an  application  that  has  a  choice  about  what 
nodes,  and  thus  what  part  of  the  environment,  to  use.  A 
Web  browser  may  be  able  to  choose  from  several  replicated 
servers  or  a  proxy  cache.  An  alternate  model  is  to  allow 
the  application  to  adapt  not  only  at  startup  but  also  at  run¬ 
time.  Such  behavior  is  more  complex  to  implement  since 
it  means  that  the  application  must  be  able  to  reconfigure 
itself.  This  capability  requires  changes  in  the  application 
state  and  compute  environment  and  therefore  typically  does 
not  exist  in  today’s  applications;  it  must  be  added  to  make 
the  application  adaptive. 

Runtime  adaptation  along  the  time  dimension  is  ad¬ 
dressed  by  protocols  such  as  TCP  and  has  also  received 
the  most  attention  from  researchers  studying  network-aware 
applications.  For  applications  that  adapt  dynamically,  we 
can  distinguish  between  applications  that  adapt  periodically 
(e.g.,  a  system  that  rebalances  the  loaded  every  k  units  of 
time)  and  systems  that  include  demand-  or  opportunity- 
driven  adaptivity,  A  system  may  adapt  whenever  some 
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performance  parameter  drops  below  a  threshold  or  may 
opportunistically  attempt  to  utilize  extra  resources  as  they 
become  available. 

Distributed  simulations  can  benefit  from  adaptation  both 
at  startup  and  at  runtime.  At  startup,  they  typically  have  to 
decide  on  the  number  of  nodes  to  use  [25]  and  on  the  setting 
of  some  control  parameters,  e.g.,  pipeline  depth  [19].  At 
runtime,  they  can  periodically  re-evaluate  their  options,  and 
adaptation  may  take  the  form  of  migration  of  the  executing 
program  to  a  different  part  of  the  network  or  rebalancing  or 
remapping  of  the  computation  on  the  executing  nodes.  This 
runtime  adaptation  adds  considerable  complexity  to  the  pro¬ 
gram  development  and  adaptation  process,  but  is  essential 
to  get  good  performance  for  long  running  applications  in 
dynamically  changing  conditions. 

2.3  Information  about  the  environment 

To  adapt,  applications  need  information  on  the  status  of 
the  environment.  Traditionally,  for  network  resources  this 
task  has  been  performed  by  communication  protocols,  such 
as  TCP,  so  it  is  worthwhile  to  look  at  how  these  protocols 
collect  information  about  network  conditions.  Protocols 
are  often  classified  as  using  implicit  or  explicit  feedback 
from  the  network.  In  protocols  based  on  implicit  feedback, 
the  receiver  monitors  the  incoming  data  stream  and  uses 
this  stream  to  derive  information  about  network  conditions. 
TCP  is  a  good  example:  dropped  packets  are  viewed  as  a 
sign  of  congestion,  and  the  sender  responds  by  reducing  its 
rate.  In  contrast,  with  explicit  feedback,  some  entity  inside 
the  network  provides  explicit  information  about  network 
conditions  to  senders.  A  good  example  is  the  ATM  ABR 
traffic  class:  senders  receive  periodic  information  about 
network  congestion  conditions  (e.g.,  congestion  bit  in  rate 
management  cells)  or  even  the  specific  maximum  traffic 
rate  they  are  allowed  to  use  (e.g.,  EPRCA), 

Implicit  feedback  has  the  advantage  that  it  does  not  re¬ 
quire  support  from  the  network,  so  this  approach  to  provide 
feedback  is  always  feasible^ .  Implicit  feedback  also  has 
some  disadvantages:  (i)  it  only  allows  incremental  adapta¬ 
tion  (i.e.,  when  two  hosts  communicate,  implicit  informa¬ 
tion  provides  updates  on  how  the  bandwidth  between  these 
hosts  evolves),  and  (ii)  it  is  sometimes  difficult  to  interpret 
the  “information”.  (E.g.,  packet  loss  is  an  indication  of  con¬ 
gestion,  but  it  is  not  always  clear  how  the  application  should 
respond:  pause  for  the  duration  of  a  round-trip  time,  reduce 
the  congestion  window,  retransmit  packets,  etc.)  Explicit 
feedback  is  in  general  easier  to  use,  but  it  requires  network 
support.  Protocols  today  primarily  rely  on  implicit  feed¬ 
back,  and  the  same  is  true  for  most  current  network-aware 
applications.  The  reason  is  simple:  implicit  feedback  does 

*  Implicit  feedback  typically  makes  some  assumptions  about  the  net¬ 
work,  e.g.,  it  considers  packet  loss  to  be  a  sign  of  congestion. 


not  require  networking  support,  which  does  not  exist. 

Explicit  information  can  be  provided  to  the  application 
in  two  ways.  First,  the  network  can  provide  feedback  con¬ 
tinuously.  This  approach  is,  e.g.,  employed  for  ABR  traf¬ 
fic:  a  rate  management  cell  is  exchanged  with  the  network 
for  every  32  data  cells.  Continuous  feedback  is  most  of¬ 
ten  based  on  network  properties  that  subsequently  must  be 
interpreted  in  the  application  space;  for  this  reason  this 
kind  of  coupling  is  also  called  indirect.  An  alternative  is 
that  applications  receive  a  notification  when  specific  events 
happen,  e.g,  an  application  receives  an  asynchronous  noti¬ 
fication  when  the  network  bandwidth  drops  below  a  certain 
threshold,  or  when  the  connection  switches  from  one  type 
of  network  to  another  [20].  Such  notifications  can  be  in  the 
form  of  callbacks,  or  by  invoking  a  specific  event  handler. 
With  this  style  of  interaction,  the  relationship  between  the 
network  event  (e.g.,  drop  in  bandwidth)  and  the  actions  (by 
the  application  or  protocol  software)  is  clearly  established 
(e.g.,  when  registering  the  handler).  Therefore  we  call  this 
style  event-driven  or  direct  coupling. 

Scientific  simulations  can  obtain  network  status  infor¬ 
mation  externally  by  using  a  tool  to  measure  the  activity 
on  the  network  or  internally  by  measuring  the  progress  of 
work  on  different  nodes  and  different  parts  of  the  network. 

Simulations  often  use  load  balancing  to  improve  the  per¬ 
formance  by  giving  less  work  to  the  nodes  that  are  running 
slower  than  others.  Load  balancing  can  be  implemented 
fairly  well  by  internal  measurements  as  the  rate  at  which 
the  work  is  progressing  is  a  good  indicator  of  the  availabil¬ 
ity  of  resources.  The  general  adaptation  model,  in  which 
the  application  monitors  its  own  performance  and  adapts 
when  it  observes  a  degradation  (e.g.  data  loss),  is  widely 
applicable.  It  is  possible  to  provide  support  for  this  form 
of  adaptation  through  the  use  of  frameworks  [2]  or  other 
adaptation  models  [20]. 

However,  other  forms  and  dimensions  of  adaptation  can¬ 
not  be  satisfied  without  external  measurements.  Selection 
of  nodes  at  the  the  start  of  execution  must  be  made  with 
external  measurements,  as  only  those  data  points  may  be 
available  at  the  time  of  invocation.  Dynamic  migration  to 
a  new  set  of  nodes,  as  well  as  addition  of  nodes  during 
execution,  also  requires  external  information,  as  internal 
information  is  limited  to  the  current  executing  nodes. 

Finally,  it  is  useful  to  distinguish  between  the  interface 
used  by  the  application  and  the  functionality  supported  by 
the  network,  since  it  is  possible  for  a  library  or  middle¬ 
ware  layer  to  translate  one  interface  into  another.  E.g.,  a 
library  could  translate  continuous  network  feedback  into 
event-based  application  feedback.  A  more  interesting  ap¬ 
proach  includes  activities  by  the  middleware:  middleware 
could  use  a  set  of  benchmarks  to  collect  information  on 
the  conditions  in  a  network  (that  does  not  provide  explicit 
feedback)  and  present  this  information  to  the  application  in 
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an  explicit  form.  In  this  paper,  we  focus  on  the  application 
level  interface,  and  we  only  touch  briefly  on  the  lower  level 
interface  when  we  discuss  implementation  options. 

2.4  Network-application  interactions 

To  be  able  to  adapt  along  all  three  dimensions,  applications 
will  need  information  on  network  conditions,  which  span 
a  matching  space  with  the  same  dimensions.  How  appli¬ 
cations  collect  network  information  determines  how  easily 
applications  can  explore  this  space. 

Implicit  information  is  based  on  experience,  which 
severely  restricts  what  part  of  the  information  space  is  ex¬ 
plored:  the  application  only  learns  about  the  part  of  the 
space  it  currently  operates  in.  This  means  that  it  can  collect 
information  only  on  the  network  conditions  along  the  paths 
it  is  currently  using  and  on  the  service  class  it  is  in.  Implicit 
feedback  provides  information  along  the  time  dimension, 
but  only  while  the  application  is  actively  using  the  network. 
At  startup  or  after  the  application  has  been  idle  for  a  while, 
no  useful  information  is  available. 

We  argue  that  given  the  limits  on  what  information  can 
be  collected  using  implicit  feedback,  mechanisms  must  be 
provided  so  that  applications  can  get  explicit  information 
on  network  conditions,  e.g.,  by  querying  a  standard  inter¬ 
face.  Such  an  interface  should  allow  applications  to  collect 
information  on  network  conditions  in  the  entire  network 
space  (space,  time  and  service  class  dimensions),  allowing 
applications  to  make  adaptation  decisions  in  space,  across 
service  classes,  and  at  startup. 

One  can  argue  that  the  restrictions  on  the  type  of  in¬ 
formation  that  can  be  collected  using  implicit  feedback  is 
not  fundamental.  Applications  can  use  probing  to  explore 
the  entire  information  space,  e.g.,  they  can  periodically  try 
all  service  classes  and  they  can  measure  the  network  per¬ 
formance  between  every  pair  of  usable  hosts.  While  this 
approach  may  be  appropriate  in  some  cases,  it  is  in  general 
undesirable.  First,  developing  effective  network  probing 
routines  is  difficult;  it  is  not  something  that  application  de¬ 
velopers  should  be  required  to  do.  Second,  probing  can  be 
expensive,  both  in  terms  of  elapsed  time  for  the  application 
and  consumed  network  resources.  Furthermore,  excessive 
probing  may  disturb  the  measurements  taken  by  this  and 
other  applications.  In  fact,  large  scale  probing  by  appli¬ 
cations  would  negate  many  of  the  advantages  of  implicit 
feedback.  If  probing  is  needed,  it  should  be  performed  by 
a  middleware  layer.  Then  the  probing  code  can  be  devel¬ 
oped  as  part  of  the  network  architecture,  and  the  collected 
information  can  be  shared  by  many  applications. 

Note  that  we  are  proposing  explicit  feedback  as  a  com¬ 
plementary  mechanism  to  implicit  feedback,  and  not  as  a 
replacement.  Implicit  feedback  has  clear  advantages  when 
used  appropriately.  Implicit  feedback  will  remain  useful  as 


an  inexpensive  way  of  getting  continuous  feedback,  once 
a  particular  operating  point  along  the  space  and  service  di¬ 
mensions  has  been  selected.  Implicit  fe^back  is  also  likely 
to  give  more  accurate  and  timely  information  (in  a  narrower 
part  of  the  operating  space)  than  explicit  feedback. 

3  A  base  system:  Remos 

The  Remos  API  provides  a  query-based  interface  that  al¬ 
lows  clients  to  obtain  “best-effort”  information  [14]  on  net¬ 
work  conditions.  The  applications  specifies  the  kind  of 
information  it  needs,  and  Remos  supplies  the  best  available 
information.  To  limit  the  scope  of  the  query,  the  application 
must  select  network  parameters  and  parts  of  a  larger  net¬ 
work  that  are  of  interest.  In  this  section  we  briefly  describe 
the  main  Remos  features.  A  more  detailed  description  can 
be  found  elsewhere  [14]. 

3.1  Level  of  abstraction 

To  accommodate  the  diverse  application  needs,  the  Remos 
API  provides  two  levels  of  abstraction:  high  level  flow- 
based  queries  and  lower  level  topology-based  queries. 

Remos  supports  flow-based  queries.  A  flow  is  an 
application-level  connection  between  a  pair  of  computa¬ 
tion  nodes.  Queries  about  bandwidth  and  latency  of  sets  of 
flows  form  the  core  of  the  Remos  interface.  Using  flows 
instead  of  physical  links  provides  a  high  level  of  abstraction 
that  makes  the  interface  portable  and  independent  of  system 
details.  Flow-based  queries  place  the  burden  of  translating 
network-specific  information  into  application-oriented  in¬ 
formation  on  the  implementor  of  the  API.  However,  flows 
are  an  intuitive  abstraction  for  application  developers,  and 
they  allow  the  development  of  adaptive  network  applica¬ 
tions  that  are  independent  of  the  heterogeneity  inherent  in 
a  network  computing  environment. 

Remos  also  supports  queries  about  the  network  topology. 
The  reason  we  expose  a  network-level  view  of  connectiv¬ 
ity  is  that  certain  types  of  questions  are  more  easily  or 
more  efficiently  answered  based  on  topology  information. 
E.g.,  finding  the  pair  of  nodes  with  the  highest  bandwidth 
connectivity  is  expensive  using  only  flow-based  queries. 
The  topology  information  provided  by  Remos  consists  of 
a  graph  with  compute  nodes,  network  nodes,  and  links, 
each  annotated  with  their  physical  characteristics,  such  as 
latency  and  available  bandwidth.  Topology  queries  return  a 
logical  interconnection  topology.  This  means  that  the  graph 
represents  the  network  behavior  as  seen  by  the  application, 
and  does  not  necessarily  reflect  the  physical  topology.  Us¬ 
ing  a  logical  topology  gives  Remos  the  option  of  hiding 
network  features  that  do  not  affect  the  application.  E.g., 
subnets  can  be  replaced  by  (logical)  links  if  their  internal 
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structure  does  not  affect  applications.  Topology  informa¬ 
tion  is  in  general  harder  to  use  than  flow-based  information, 
since  the  complexity  of  translating  network-level  data  into 
application-level  information  is  mostly  left  to  the  user. 

3.2  Dynamic  resource  sharing 

Since  networks  are  a  shared  resource,  it  is  important  to 
account  for  the  manner  in  which  resources  are  shared  by 
multiple  flows.  Since  multi-party  applications  use  multiple 
flows,  it  is  not  only  necessary  to  account  for  sharing  across 
applications,  but  also  across  flows  belonging  to  the  same 
application.  To  be  able  to  consider  the  effects  of  "internal" 
sharing,  Remos  supports  multi-flow  queries  in  which  the 
application  lists  all  its  flows  simultaneously.  Applications 
can  generate  flows  with  very  diverse  sharing  characteristics, 
ranging  from  constrained  low-bandwidth  audio  to  bursty 
high-bandwidth  data  flows.  Remos  collapses  this  broad 
spectrum  into  three  types  of  flows.  Fixed  flows  have  a  spe¬ 
cific  bandwidth  requirement.  Variable  flows  have  related 
requirements  and  demand  the  maximum  available  band¬ 
width  that  can  be  provided  to  all  such  flows  in  a  given  ratio. 
(E.g.,  all  flows  in  a  typical  all-to-all  communication  oper¬ 
ation  have  the  same  requirements.)  Finally,  independent 
flows  simply  want  maximum  available  bandwidth.  These 
flow  types  also  reflect  priorities  when  sufficient  resources 
are  not  available  to  satisfy  all  the  flows.  Fixed  flows  are  con¬ 
sidered  first,  followed  by  variable  flows,  then  independent 
flows. 

Determining  how  the  throughput  of  a  flow  is  affected 
by  other  messages  in  transit  is  very  complicated  and  net¬ 
work  specific.  Remos  approximates  this  complex  behavior 
by  assuming  that,  all  else  being  equal,  the  bottleneck  link 
bandwidth  is  shared  equally  by  all  flows  (that  are  not  bot¬ 
tlenecked  elsewhere).  If  other  information  is  available,  Re¬ 
mos  can  use  different  sharing  policies  when  estimating  flow 
bandwidths.  The  basic  sharing  policy  assumed  by  Remos 
corresponds  to  the  max-min  fair  share  policy  [11].  Ap¬ 
plications  that  use  topology-based  queries  are  themselves 
responsible  for  taking  the  effects  of  both  internal  and  exter¬ 
nal  sharing  into  account. 

3.3  Accuracy 

Applications  ideally  want  information  about  the  level  of 
service  they  can  expect  to  receive  in  the  future,  but  most 
users  today  must  use  past  performance  as  a  predictor  of 
the  future.  Different  applications  are  also  interested  in 
activities  on  different  timescales.  A  synchronous  parallel 
application  expects  to  transfer  bursts  of  data  in  short  periods 
of  time,  while  a  long  running  data  intensive  application 
may  be  interested  in  throughput  over  an  extended  period 
of  time.  For  this  reason,  relevant  queries  in  the  Remos 


interface  accept  a  timeframe  parameter  that  allows  the  user 
to  request  data  collected  and  averaged  for  a  specific  time 
window. 

Network  information  such  as  available  bandwidth 
changes  continuously  due  to  sharing  and  as  a  result,  charac¬ 
terizing  these  metrics  by  a  single  number  can  be  misleading. 
E.g.,  knowing  that  the  bandwidth  availability  has  been  very 
stable  represents  a  different  scenario  from  it  being  an  av¬ 
erage  of  rapidly  changing  instantaneous  bandwidths.  To 
address  these  aspects,  the  Remos  interface  adds  statistical 
variability  and  estimation  accuracy  parameters  to  all  dy¬ 
namic  quantitative  information.  Since  the  actual  distribu¬ 
tions  for  the  measured  quantities  are  generally  not  known, 
we  present  the  variability  of  network  parameters  using  quar- 
tiles  [12]. 


3.4  Implementation 

An  initial  version  of  Remos  API  has  been  implemented. 
It  has  two  components,  a  collector  md  modeler,  that  are 
responsible  for  network-oriented  and  application-oriented 
functionality,  respectively.  The  collector  is  responsible  for 
collecting  low-level  network  information.  Such  data  can 
be  collected  in  many  ways,  e.g.,  one  can  periodically  run 
benchmarks  that  probe  the  network  for  available  bandwidth 
or  rely  on  information  gathered  by  applications  [21].  Our 
current  implementation  uses  a  third  method:  the  collector 
explicitly  queries  routers  using  SNMP  [3]  for  both  topology 
and  dynamic  bandwidth  information.  The  use  of  SNMP  to 
obtain  information  about  the  state  of  a  network  is  a  stan¬ 
dard  way  of  monitoring  networks,  and  it  should  allow  us 
to  collect  detailed  information  in  a  relatively  non-intrusive 
way  on  a  broad  set  of  networks.  The  modeler  is  a  library 
that  is  linked  with  the  application;  it  translates  the  infor¬ 
mation  provided  by  the  collector  into  a  logical  topology 
graph  or  per-flow  data  in  response  to  application  requests. 
The  modeler-collector  architecture  is  in  part  motivated  by 
the  need  to  support  scalability  and  network  heterogeneity. 
In  large  networked  environments,  multiple  collectors  may 
have  to  be  deployed,  and  each  collector  can  collect  infor¬ 
mation  in  a  way  that  is  most  appropriate  for  the  network  it  is 
responsible  for.  Work  is  in  progress  on  implementing  col¬ 
lectors  that  use  sources  of  network  information  other  than 
SNMP,  e.g.,  by  active  measurements. 

For  the  results  presented  in  this  paper  we  used  the  Re¬ 
mos  interface  on  a  dedicated  IP-based  testbed  at  Carnegie 
Mellon  University  that  is  illustrated  in  Figure  2. 
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Indirect  (continuous) 

Queries  to  network 
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Rate  management  cells 

Counting  lost  packets 

Event-driven  (direct) 

Handlers  to  react  to 
changes  in  the  network 

Reacting  to  lost 
retransmissions 

Figure  1:  Examples  for  two  dimensions  of  application/network  coupling. 


Links:  100Mbps  poInt-to-poInt  ethernet 

Endpoints:  DEC  Alpha  Systems  {manchester-*  labeled  m~*) 

Routers:  Pentium  Pro  PCs  running  NetBSD  (aspen, timbeiiine,  whiteface) 


Figure  2:  Testbed  used  for  Airshed  experiments 


4  Case  study  in  adaptive  execution: 
Airshed  pollution  modeling 

We  have  developed  a  suite  of  tools  to  develop  adaptive  dis¬ 
tributed  programs  driven  by  RemoS  and  have  gained  expe¬ 
rience  with  programs  ranging  from  small  kernels  like  fast 
Fourier  transforms  to  complete  applications  like  Airshed 
pollution  modeling  and  magnetic  resonance  imaging[5].  In 
this  paper,  we  focus  exclusively  on  Airshed  pollution  mod¬ 
eling  application  and  present  results  comparing  the  perfor¬ 
mance  of  the  basic  implementation  with  various  adaptive 
versions. 

The  Airshed  application  [15]  models  formation,  reac¬ 
tion,  and  transport  of  atmospheric  pollutants  and  related 
chemical  species.  We  implemented  a  distributed  version  of 
Airshed  using  Fx  data  parallelism  [8].  Data  parallelism  in 
Fx  is  similar  to  High  Performance  Fortran  [9],  so  these  ob¬ 
servations  apply  to  other  applications  as  well.  An  adaptive 
version  of  Airshed  was  developed  using  integrated  task  and 
data  parallelism  in  Fx  [24].  For  efficient  execution,  this  ap¬ 
plication  involves  significant  communication  in  the  form  of 


array  redistributions  since  the  various  chemistry  and  trans¬ 
port  phases  access  the  main  particle  array  along  different 
dimensions.  The  details  of  the  Airshed  implementation  are 
described  in  [23]. 

We  executed  Airshed  using  a  tool  that  automatically  se¬ 
lects  the  best  nodes  for  execution  based  on  the  network 
information  provided  by  Remos.  The  details  of  this  node 
selection  procedure  and  its  validation  is  discussed  in  [22]. 
Tablel  presents  the  results  obtained  on  our  networking 
testbed,  which  consists  of  a  number  of  DEC  Alpha  work¬ 
stations  connected  via  three  routers.  The  testbed  allows  us 
to  configure  the  bandwidth  between  the  routers,  as  well  as 
to  apply  various  traffic  patterns  for  controlled  experiments. 
Figure  2  shows  the  set-up. 

We  observe  that  automatic  node  selection  has  little  im¬ 
pact  on  performance  in  the  absence  of  network  traffic.  In 
the  presence  of  a  fixed  traffic  stream  that  saturates  one  of 
the  conununication  links,  automatic  node  selection  more 
than  halves  the  execution  time.  The  reason  is  that  Airshed 
is  a  SPMD  application  with  a  significant  communication 
component,  and  saturation  of  a  single  link  creates  a  bot¬ 
tleneck  that  slows  down  the  entire  computation.  However, 
it  is  possible  to  select  a  set  of  nodes  automatically  using 
Remos  information  such  that  the  busy  links  are  avoided 
for  program  communication.  The  last  two  columns  in  the 
table  show  performance  on  the  network  with  load  genera¬ 
tors  that  simulate  moderate  utilization  of  network  resources. 
We  observe  that  the  performance  is  considerably  enhanced 
with  automatic  node  selection  as  the  node  selection  suc¬ 
ceeds  in  avoiding  congested  links  and  busy  processors  in 
many  cases.  However,  such  enhancements  are  not  always 
possible  when  the  network  is  heavily  used,  and  hence  the 
performance  advantage  is  not  to  the  extent  observed  for  a 
single  congested  link. 

The  results  highlight  the  importance  of  simple  adapta¬ 
tion  in  the  resource  space  dimension  at  start-up  time,  and 
demonstrate  that  a  toolset  based  on  external  measurements 
can  effectively  drive  such  adaptation.  Note  that  internal 
measurements  made  by  a  program  are  not  of  any  use  in 
deciding  which  nodes  the  application  should  be  started  on. 
The  main  drawback  of  adaptation  only  at  start-up  time  is  that 
network  conditions  change  over  time,  and  hence  adaptation 
during  execution  is  important  for  long  running  applications. 
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Execution 

Node 
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No 

Network  Traffic 

Execution  time  v 
Fixed 

Network  Traffic 

dth  external  load  and  tra 
Dynamically  varying 
Traffic 

ffic 

Dynamically  varying 
Traffic  and  Load 

Random 

652 

1726 

1125 

2121 

Automatic 

650 

674 

754 

1420 

Table  1:  Performance  results  of  5-node  Airshed  in  different  network  conditions.  Execution  times  using  automatic  node 
selection  are  compared  with  those  obtained  with  random  node  selection.  For  the  case  of  dynamically  varying  traffic,  only  1/4 
of  the  Airshed  processing  was  done  in  one  invocation  and  the  results  shown  are  scaled  up  for  comparison 


Table2  presents  preliminary  results  from  a  dynamic 
adaptive  version  of  Airshed.  This  version  queries  Remos 
for  network  status  after  every  major  simulation  step,  and 
migrates  to  a  new  set  of  nodes  if  the  current  set  of  nodes 
or  links  become  considerably  more  busy  than  other  parts  of 
the  network.  For  the  purpose  of  comparison,  we  created  an 
adaptive  version  that  would  not  actually  migrate  (but  had 
adaptation  support  built  into  it)  and  compared  it  to  the  ac¬ 
tual  migrating  adaptive  application,  under  different  network 
conditions.  Both  were  started  on  the  same  set  of  nodes,  and 
a  fixed  traffic  pattern  was  maintained  for  the  duration  of 
the  experiment.  The  interfering  and  non-interfering  pat¬ 
terns  are  relative  to  the  set  of  nodes  and  links  on  which  the 
application  was  started. 

We  first  observe  that  both  the  versions  run  slower  than 
the  non-adaptive  versions  of  Airshed  discussed  earlier,  even 
in  the  absence  of  any  load  or  traffic.  This  observation  points 
out  the  fixed  overhead  of  adaptation  support.  In  the  absence 
of  interfering  network  traffic,  the  migrating  version  exe¬ 
cutes  slightly  slower  than  the  static  version.  This  difference 
reflects  two  types  of  overheads  associated  with  migration. 
First,  there  is  a  cost  associated  with  analyzing  and  deciding 
the  best  nodes  for  execution.  Second  is  the  cost  associated 
with  unnecessary  migration,  which  can  happen  because  of 
the  heuristic  nature  of  adaptation  decisions.  Finally,  we  ob¬ 
serve  that  the  adaptive  version  performs  significantly  better 
in  the  presence  of  interfering  traffic.  The  general  conclusion 
is  that  support  for  adaptation  entails  moderate  overheads, 
but  it  can  minimize  the  impact  of  external  traffic  on  execu¬ 
tion  times. 

This  experiment  highlights  the  importance  of  runtime 
adaptation  and  demonstrates  that  an  external  tool  like  Re¬ 
mos  is  effective  in  driving  the  dynamic  adaptation  process. 
Note  that  runtime  migration  cannot  be  done  with  internal 
application  measurements,  since  they  are  not  available  for 
the  nodes  that  the  application  is  not  currently  executing  on. 
However,  internal  measurements  can  be  used  for  load  bal¬ 
ancing,  which  is  an  example  of  an  application  modifying  its 
demands  in  response  to  changes  in  the  resource  availability. 
It  is  clear  that  adaptation  by  migration  exploits  a  degree  of 
freedom  in  the  resource  dimension  that  is  not  available  to 
load  balancers. 


5  Related  work 

An  important  contribution  of  our  research  is  to  provide  a 
structure  to  adaptivity  options,  especially  in  the  context  of 
distributed  simulations.  We  are  not  aware  of  any  work  that 
specifically  addresses  this  aspect.  However,  a  number  of 
projects  address  measurement  and  management  of  network 
resources  that  complement  the  Remos  system  discussed 
in  the  paper.  We  also  discuss  some  other  approaches  to 
adaptivity  reported  in  the  literature. 

5.1  Network  resource  management  and  mea¬ 
surements 

A  number  of  resource  management  systems  allow  appli¬ 
cations  to  make  queries  about  the  availability  of  compu¬ 
tation  resources,  some  examples  being  Condor  [13]  and 
LSF  (Load  Sharing  Facility).  Resource  management  sys¬ 
tems  for  large  scale  internet-wide  computing  is  an  impor¬ 
tant  area  of  current  research,  and  some  well  known  efforts 
are  Globus  [6]  and  Legion  [7].  These  systems  provide 
support  for  a  wide  range  of  functions  such  as  resource  lo¬ 
cation  and  reservation,  authentication,  and  remote  process 
creation  mechanisms.  Recent  systems  that  focus  on  mea¬ 
surements  of  communication  resources  across  internet  wide 
networks  include  Network  Weather  Service  (NWS)  [26]  and 
topology-d  [16].  NWS  makes  resource  measurements  to 
predict  future  resource  availability,  while  topology-d  com¬ 
putes  the  logical  topology  of  a  set  of  internet  nodes.  Both 
these  systems  actively  send  messages  to  make  communi¬ 
cation  measurements  between  pairs  of  computation  nodes. 
A  number  of  sites  are  collecting  Internet  traffic  statistics, 
e.g.,  [1].  This  information  is  not  in  a  form  that  is  usable 
for  applications,  and  it  is  typically  also  at  a  coarser  grain 
than  what  applications  are  interested  in  using.  Another 
class  of  related  researches  the  collection  and  use  of  ap¬ 
plication  specific  performance  data,  e.g.,  a  Web  browser 
that  collects  information  about  the  response  times  of  dif¬ 
ferent  sites  [21].  Related  work  also  addresses  estimating 
stochastic  values  [18]  that  represent  varying  quantities  on 
networks. 

In  comparing  with  some  of  these  projects,  the  Remos 
interface  focuses  on  providing  good  abstractions  and  sup- 
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Table  2:  Execution  times  of  adaptive  version  of  Airshed  executing  on  a  fixed  set  of  nodes  and  on  dynamically  selected  nodes 


port  for  application  level  access  to  network  status  informa¬ 
tion  and  allows  for  a  closer  coupling  of  applications  and 
networks.  Remos  implementations  make  measurements 
at  network  level  when  possible;  this  strategy  minimizes 
the  measurement  overhead  and  yields  key  information  for 
managing  sharing  of  resources. 

5.2  Other  models  and  extensions 

Several  approaches  to  provide  adaptivity  without  changes 
to  the  programming  model  have  been  researched  in  the  lit¬ 
erature.  Nevertheless,  it  is  interesting  to  note  that  these 
systems  include  components  that  map  directly  into  the  con¬ 
cepts  discussed  in  this  paper. 

A  number  of  groups  have  looked  at  the  benefits  of  explicit 
feedback  to  simplify  and  speed  up  adaptation  (e.g.,  [10]). 
However,  the  interfaces  developed  by  these  efforts  have 
been  designed  specifically  for  the  scenarios  being  studied. 

The  Quality  Objects  QuO  system[27]  provides  adaptiv¬ 
ity  in  the  context  of  object-oriented  programming.  To  pro¬ 
vide  the  feedback  between  applications  and  environment, 
the  QuO  system  includes  system  condition  objects  that  drive 
adaptivity  either  implicitly  or  explicitly. 

An  adaptive  system  that  provides  a  shared-memory  pro¬ 
gramming  model  for  a  network  of  workstations  or  PCs  can 
take  advantage  of  additional  nodes  and  also  deal  with  with¬ 
drawal  of  a  nodes  [17].  Here  the  control  of  adaptivity  rests 
with  the  (software)  distributed  shared-memory  system,  but 
the  application  (or  the  compiler)  determines  the  points  in 
the  execution  of  the  program  where  adaptivity  is  possible. 

6  Concluding  remarks 

Figure  1  gives  examples  of  the  4  different  kinds  of  couplings 
between  applications  and  networks  that  are  discussed  in 
this  paper.  Network-aware  applications  today  focus  over¬ 
whelmingly  on  implicit  interaction.  We  argue  that  this  state 
of  affairs  is  due  to  the  current  (lack  of)  support  for  other 
interactions  models  by  network  architectures.  As  network 
architectures  begin  to  provide  information  that  is  more  ac¬ 
curate,  more  timely,  and  more  detailed,  network- aware  ap¬ 
plications  will  be  motivated  to  also  explore  explicit  interac¬ 
tion. 

We  show  how  the  adaptivity  options  for  distributed  sim¬ 
ulations  fit  in  the  general  framework  of  adaptive  execution. 


The  results  demonstrate  that  portable  external  mechanisms 
for  network  measurements  are  necessary  to  support  effec¬ 
tive  adaptive  execution  of  large  distributed  scientific  appli¬ 
cations. 
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